Assignment: Integrative Genomics Practical Course: UVIC MS Omics Data Analysis-Interactomics
The aim of this practical is to study the impact of age in adrenocortical carcinoma (ACC).For that, an analysis needs to be performed integrating at least 2 different blocks of omics data following these steps:
SUMMARY OF RESULTS AND CONCLUSIONS
A MultiAssayExperiment object miniACC was evaluated to study the impact of age in adrenocortical carcinoma (ACC). Ten patients (5 old and 5 young) were selected from the right and left tails of a graphically-depicted normal distribution of the variable age (years_to_birth) on common samples of three individual (Ranged)SummarizedExperiments depicting mRNA-seq, miRNA-seq, and gene-based GISTIC CNV recurrent lesions. These three data blocks were then filtered, TPM-normalized, log tansformation, scaled, and individually analyzed and later correlated before being ultimately evaluated via Multi-Factor Analysis and other means.
Regarding mRNA-seq Summarized Experiment:
The ward.D2 hierarchal clustering appeared to reflect the segregation of 5 old and 5 young patients.There appeared to be a normal distribution of log2-ratios of TPM-normalized mRNA-seq count values. PCA analysis revealed no apparent segregation of by age status, and a total of 25.07%+ 19.65%=44.72% variance was accounted for by the first 2 principal components PC1 and PC2 and corresponding eigenvector values.A heatmap visually revealed that old patients were relatively underexpressing more mRNA genes.Based on pooled results from limma/voom/edgeR, and DESeq2 modeling, the genes that were deemed to be differentially expressed with respect to old/young age status included AKT1S1,ASNS,NRAS,MAPK9,ITGA2,ADAR,EGFR, FASN,SERPINE1,TSC2,YBX1,SHC1,TGM2,RAD50,PIK3R1, XBP1, SYK, and CDKN2A. From DESeq2 model alone, there were 3 statistically differentially overexpressed (ITGA2, TGM2, ASNS) and 2 statistically differentially underexpressed genes (CDKN2A, NRAS) identified.Based on BioMart-derived description and GO-based Gene Ontology analysis, some of these statistically differentially overexpressed protein-coding genes were associated with phago-and endo-cytosis and asparagine-glutamine metabolic processes.Chromosomes 1,5, and 7 had the most differentially expressed genes. Evidently, TGM2 is overexpressed in old patients and underexpressed in young patients CDKN2A is overexpressed in young patients. CDKN2A is abberantly downregulated in the Old Patient A5LC
Regarding miRNA-seq Summarized Experiment data block:
The ward.D2 hierarchal clustering did not appear to reflect the segregation of 5 old and 5 young patients.PCA analysis revealed that with the exception of young patients A5J9 and A5JI and old patient A5LC, differences between the young and old patient samples (in dim 1 and dim2)were observed, along with a significant 28.27+18.86%=47.13% total variance being captured by the first 2 dimensions, respectively.Based on pooled modeling results from limma/voom/edgeR, and DESeq2 modeling, the miRNA genes that were deemed to be differentially expressed with respect to old/young age status included hsa-mir-153-2, hsa-mir-153-1, hsa-mir-541, hsa-mir-412, hsa-mir-3200, hsa-mir-675, hsa-mir-1248, hsa-mir-9-2, hsa-mir-9-1, hsa-mir-1229, hsa-mir-511-1,hsa-mir-507,hsa-mir-107,hsa-mir-148b, hsa-mir-542, hsa-mir-98, hsa-mir-887, and hsa-mir-9-3. Specifically, based on visual heatmap, miRNA genes hsa-mir-511-1 was overexpressed in old patient A5LL, A5L5 and underexpressed in young patients A5LE and A5J9 and A5KV. On the other hand, miRNA gene hsa-mir-675 was underexpressed in young patients A5J9, A5JI, A5K0, A5JE, A5KV and overexpressed in A5LL, A5JF, and slightly in A5LC, A5L5. Based on NCBI and BioMart-derived data, hsa-mir-1229 and hsa-mir-675 are located on chromosomes 5q35.3 and 11. Gene hsa-mir-511-1 is situated on chromosome 10 at 17845107..17845193. Evidently, chromosomes x and 5 has the most (3) significantly DGE miRNA genes.Based on GO-based Gene Ontology analysis, the statistically differentially expressed miRNA genes are associated with regulation of phosphorous metabolism. The targetscan and getMIR approaches were both used to determine the mRNA gene targets of these identified miRNA genes. Of all in the DGE miRNA gene list, only 3 were successfully queried with get_multimir to identify their mRNA targets. Of all identified targets of these 3, only CDKN1A target of hsa-miR-1248 and SERBP1 target of hsa-miR-107 appear distantly related (by gene symbol similarity) to the RNA-seq DGE genes of CDKN2A and SERPINE1. Using the targetscan approach, the expression of miRNA gene hsa-let-7i was found to be significantly correlated with expression of protein-coding genes CASP3 and GAB2. Unfortunately, the function-based automatic conversion of miRNA-seq Summarized Experiment to Ranged Summarized Experiment split the Summarized Experiment into Ranged and Unranged sets and not subsequently used.
Regarding GISTIC CNV Summarized Experiment:
The ward.D2 hierarchal clustering did not appear to reflect the segregation of 5 old and 5 young patients.Based on PCA analysis, there did not appear to be segregation by age status for gene-based GISTIC recurrent region state values,and a significant total of 21.26%+ 29.4 %= 50.66% variance was accounted for by the first 2 principal components PC1 and PC2 and corresponding eigenvector values. Multiple simple linear regression was performed on all gene GISTIC CNV values (dependent variables) and categorical factor age/old age.status (independent variable), and the gene that had the lowest p-value for differential GISTIC cnv value with respect to young/old age status was FOXO3.The readGistic function was explored to read in files provided manually after obtaining them via TCGAUtils or a directory containing GISTIC results and import all the relevant files. However, we were not successful at obtaining the required “all-lesions_CV.txt” file but were successful at graphically depicting GISTIC peak regions via associated plotting functions. Furthermore, the associated ACC “CNV INdividual Calls” Summarized Experiment with assays matrix was successfully downloaded via query from TCGA and added to our original miniACC MultiExperiment object, but equalization of samples and patients with the other data blocks could not be done. Therefore, the CNVRanger package and associated functions were used to further analyze our gene-wide GISTIC CNV recurrent lesions Summarized Experiment by assuming instead that the GISTIC data represented original “individual calls” that was subsequently converted to GISTIC summarized population recurrent gene-based lesion regions. A resulting CNVRanger permutation test p-value indicated a significant depletion where Out of the 197 CNV regions (cnvrs object), 33 overlapped with at least one gene.The CNVRanger findOverlaps function from the GenomicRanges package was a general function for finding overlaps between two sets of genomic regions and was used to find protein-coding genes overlapping aforementioned 33 summarized CNV regions.
Correlation between CNV and mRNA-seq Data Blocks:
Differential expression of genes in the neighborhood of CNV region of interest # 1,2,3,4,8,9,13,16,23,34,35 were visually illustrated via CNVRanger function plotEQTL. Furthermore, when correlating RAW (unfiltered, non-normalized, non-transformed) mRNA-seq and GISTIC CNV assay data, the following 12 genes were identified to be strongly correlated across all patients (young and old combined): “ATM”, “ACVRL1”, “TSC1”,“GSK3A”, “KEAP1”, “XRCC1”, “NFKB1”, “NF2”, “MYH9”, “YWHAB”, “MSH2”, and “DIABLO”. Furthermore, 44 and 50 genes were significantly correlated across the 5 selected old and young patients, respectively. MFA for CNV and mRNA-seq only data that had been filtered, TPM-normalized, log-transformed, scaled showed segregation between old and young patients. For this MFA the first dimension revealed highest contribution from mRNA-seq gene expresion (SMAD1,SRC, PIK3R1, PRKAA1, AKT3, NFKB1, MAPK9, AKT1, PRKCA, SQSTM1) and the second dimension revealed highest contribution from GISTIC CNV gene copy number variation (SRC,TGM2, E2F1,NCOA3, BCL2L1, PRKAA1, YWHAB, PREX1, CDKN1B,ERBB3)
Correlation between mRNA-seq and miRNA-seq Data Blocks:
The expression of miRNA gene hsa-let-7i was found to be significantly correlated with expression of protein-coding genes CASP3 and GAB2.
MFA between mRNA-seq, miRNA-seq, and GISTIC CNV Data Blocks
The filtered, TPM-transformed, log-transformed, scaled data of all 3 data block Summarized Experiments for mRNA-seq, miRNA-seq, and GISTIC CNV were jointly evaluated via MFA. Overall, Multi-FActor Analysis (MFA) helps elucidate the underlying structure of the data by reducing its dimensionality and highlighting the relationships between variables and observations. Based on MFA summary eigenvalues, the first three dimensions of MFA capture 57.77% (24.66% (dim1)+18.85% (dim2) + 14.268 (dim3)) of total variance. Based on MFA summary group analysis, compared to GISTIC cnv recurrent lesions, the miRNA-seq and mRNA-seq variables co-contribute most and have highest significant impact to the first dimension, while GISTIC cnv contributes the most towards dimension#2 (0.9 vs. 0.009). The top genes impacting dimension#1 are (from mRNA-seq data block variable) SMAD1,SRC, PIK3R1, PRKAA1, AKT3, NFKB1, MAPK9, AKT1, PRKCA. and SQSTM1. The top genes impacting dimension#2(from GISTIC CNV gene-based recurrent lesions data block variable) are SRC, TGM2, E2F1, NCOA3, BCL2L1, PRKAA1, YWHAB, PREX1, CDKN1B, and ERBB3. The top genes impacting dimension#3(from miRNA-seq data block variable) are hsa.mir.196a.2,hsa.mir.106b, hsa.mir.196a.1, hsa.mir.25, hsa.mir.16.2, hsa.mir.196b, hsa.mir.92a.2, and (from mRNA-seq data block)CDK1, FOXM1,and ACACB. Based on MFA analysis, there is clear separation between cnv, mRNA, and miRNA block data. Based on individuals Analysis examining how individual data points relate to each dimension, the first ten individuals show their positions in the multidimensional space.No clear segregation between young and old patient samples is apparent. Of the ten selected patient samples, A5J9 (young), A5JF(old),A5JI(young),A5K0(old),A5L5(old),A5LL(old) contribute positive coefficients towards dimension#1, while A5JE (young), A5KV(young),A5LC(old),A5LE(young) contribute negative coefficients towards dimension#1. Young Patients TCGA.OR.A5LE, A5J9, A5JE appear to be outliers. Old patients A5K0, A5LL, A5JF, and A5LC appear to be outliers, suggesting that the 10 patients selected were not appropriate for this Integrated Genomics study. The mRNA expression dimension seem to coincide with the age.status condition more than the other 2 data blocks.Based on MFA continuous Variables analysis, which indicates the relationship between the original variables, and the extracted dimensions, the mRNA-seq data block genes strongly influence Dimension 1 compared to miRNA-seq and GISTIC CNV data block variables.mRNA-seq data block quantitative variables contributed the most towards dimension#1 compared to miRNA-seq and GISTIC CNV recurrent lesions data block variables.
EXPLORATION OF MULTIASSAY EXPERIMENT, SELECTION OF 10 YOUNG/OLD PATIENTS, EQUALIZATION OF PATIENTS/SAMPLES, AND SEPARAITON OF SUMMARIZED EXPERIMENTS
#EXPLORE miniACC MultiAssayExperiment:
data(miniACC)
class(miniACC)
## [1] "MultiAssayExperiment"
## attr(,"package")
## [1] "MultiAssayExperiment"
miniACC
## A MultiAssayExperiment object of 5 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 5:
## [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 79 columns
## [2] gistict: SummarizedExperiment with 198 rows and 90 columns
## [3] RPPAArray: SummarizedExperiment with 33 rows and 46 columns
## [4] Mutations: matrix with 97 rows and 90 columns
## [5] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
#RNASeq2GeneNorm
#RNA-seq count data: an ExpressionSet with 198 rows and 79 columns
#gistict
#Reccurent copy number lesions identified by GISTIC2: a SummarizedExperiment with 198 rows and 90 columns
#RPPAArray
#Reverse Phase Protein Array: an ExpressionSet with 33 rows and 46 columns. Rows are indexed by genes, but protein annotations are available from featureData(miniACC[["RPPAArray"]]). The source of these annotations is noted in abstract(miniACC[["RPPAArray"]])
#Mutations
#Somatic mutations: a matrix with 223 rows and 90 columns. 1 for any kind of non-silent mutation, zero for silent (synonymous) or no mutation.
#miRNASeqGene
#microRNA sequencing: an ExpressionSet with 471 rows and 80 columns. Rows not having at least 5 counts in at least 5 samples were removed.
#This dataset provides five assays on 92 patients, although all five assays were not performed for every patient:
upsetSamples(miniACC)
#This graph depicts the overlapping patients fro all 5 assays
colData(miniACC)
## DataFrame with 92 rows and 30 columns
## patientID years_to_birth vital_status days_to_death
## <character> <integer> <integer> <integer>
## TCGA-OR-A5J1 TCGA-OR-A5J1 58 1 1355
## TCGA-OR-A5J2 TCGA-OR-A5J2 44 1 1677
## TCGA-OR-A5J3 TCGA-OR-A5J3 23 0 NA
## TCGA-OR-A5J4 TCGA-OR-A5J4 23 1 423
## TCGA-OR-A5J5 TCGA-OR-A5J5 30 1 365
## ... ... ... ... ...
## TCGA-PK-A5H9 TCGA-PK-A5H9 27 0 NA
## TCGA-PK-A5HA TCGA-PK-A5HA 63 0 NA
## TCGA-PK-A5HB TCGA-PK-A5HB 63 0 NA
## TCGA-PK-A5HC TCGA-PK-A5HC 44 0 NA
## TCGA-P6-A5OG TCGA-P6-A5OG 45 1 383
## days_to_last_followup tumor_tissue_site pathologic_stage
## <integer> <character> <character>
## TCGA-OR-A5J1 NA adrenal stage ii
## TCGA-OR-A5J2 NA adrenal stage iv
## TCGA-OR-A5J3 2091 adrenal stage iii
## TCGA-OR-A5J4 NA adrenal stage iv
## TCGA-OR-A5J5 NA adrenal stage iii
## ... ... ... ...
## TCGA-PK-A5H9 616 adrenal stage ii
## TCGA-PK-A5HA 1201 adrenal stage i
## TCGA-PK-A5HB 1293 adrenal NA
## TCGA-PK-A5HC 679 adrenal stage iii
## TCGA-P6-A5OG NA adrenal stage iv
## pathology_T_stage pathology_N_stage gender
## <character> <character> <character>
## TCGA-OR-A5J1 t2 n0 male
## TCGA-OR-A5J2 t3 n0 female
## TCGA-OR-A5J3 t3 n0 female
## TCGA-OR-A5J4 t3 n1 female
## TCGA-OR-A5J5 t4 n0 male
## ... ... ... ...
## TCGA-PK-A5H9 t2 n0 female
## TCGA-PK-A5HA t1 n0 male
## TCGA-PK-A5HB NA NA male
## TCGA-PK-A5HC t4 n0 female
## TCGA-P6-A5OG t4 n0 female
## date_of_initial_pathologic_diagnosis radiation_therapy
## <integer> <character>
## TCGA-OR-A5J1 2000 no
## TCGA-OR-A5J2 2004 no
## TCGA-OR-A5J3 2008 no
## TCGA-OR-A5J4 2000 no
## TCGA-OR-A5J5 2000 no
## ... ... ...
## TCGA-PK-A5H9 2012 no
## TCGA-PK-A5HA 2011 no
## TCGA-PK-A5HB 2003 yes
## TCGA-PK-A5HC 2011 no
## TCGA-P6-A5OG 2011 no
## histological_type residual_tumor number_of_lymph_nodes
## <character> <character> <integer>
## TCGA-OR-A5J1 adrenocortical carci.. r0 NA
## TCGA-OR-A5J2 adrenocortical carci.. r2 0
## TCGA-OR-A5J3 adrenocortical carci.. r0 0
## TCGA-OR-A5J4 adrenocortical carci.. r2 2
## TCGA-OR-A5J5 adrenocortical carci.. r2 NA
## ... ... ... ...
## TCGA-PK-A5H9 adrenocortical carci.. r0 NA
## TCGA-PK-A5HA adrenocortical carci.. r0 0
## TCGA-PK-A5HB adrenocortical carci.. NA NA
## TCGA-PK-A5HC adrenocortical carci.. r1 0
## TCGA-P6-A5OG adrenocortical carci.. r2 0
## race ethnicity Histology C1A.C1B
## <character> <character> <character> <character>
## TCGA-OR-A5J1 white NA Usual Type C1A
## TCGA-OR-A5J2 white hispanic or latino Usual Type C1A
## TCGA-OR-A5J3 white hispanic or latino Usual Type C1A
## TCGA-OR-A5J4 white hispanic or latino Usual Type NA
## TCGA-OR-A5J5 white hispanic or latino Usual Type C1A
## ... ... ... ... ...
## TCGA-PK-A5H9 asian not hispanic or latino Usual Type C1B
## TCGA-PK-A5HA NA NA Usual Type C1B
## TCGA-PK-A5HB NA NA Usual Type C1A
## TCGA-PK-A5HC asian not hispanic or latino Usual Type NA
## TCGA-P6-A5OG white not hispanic or latino NA NA
## mRNA_K4 MethyLevel miRNA.cluster
## <character> <character> <character>
## TCGA-OR-A5J1 steroid-phenotype-hi.. CIMP-high miRNA_1
## TCGA-OR-A5J2 steroid-phenotype-hi.. CIMP-low miRNA_1
## TCGA-OR-A5J3 steroid-phenotype-high CIMP-intermediate miRNA_6
## TCGA-OR-A5J4 NA CIMP-high miRNA_6
## TCGA-OR-A5J5 steroid-phenotype-high CIMP-intermediate miRNA_2
## ... ... ... ...
## TCGA-PK-A5H9 steroid-phenotype-low CIMP-low miRNA_5
## TCGA-PK-A5HA steroid-phenotype-low CIMP-high miRNA_5
## TCGA-PK-A5HB steroid-phenotype-high CIMP-high miRNA_6
## TCGA-PK-A5HC NA NA NA
## TCGA-P6-A5OG NA NA NA
## SCNA.cluster protein.cluster COC OncoSign purity
## <character> <integer> <character> <character> <numeric>
## TCGA-OR-A5J1 Quiet NA COC3 CN2 0.90
## TCGA-OR-A5J2 Noisy 1 COC3 TP53/NF1 0.89
## TCGA-OR-A5J3 Chromosomal 3 COC2 CN2 0.93
## TCGA-OR-A5J4 Chromosomal NA NA CN1 0.87
## TCGA-OR-A5J5 Chromosomal NA COC2 TP53/NF1 0.93
## ... ... ... ... ... ...
## TCGA-PK-A5H9 Quiet 3 COC1 TP53/NF1 0.79
## TCGA-PK-A5HA Chromosomal 2 COC1 CN2 0.83
## TCGA-PK-A5HB Noisy NA COC3 TP53/NF1 0.88
## TCGA-PK-A5HC Chromosomal NA NA TP53/NF1 0.59
## TCGA-P6-A5OG NA NA NA NA NA
## ploidy genome_doublings ADS
## <numeric> <integer> <numeric>
## TCGA-OR-A5J1 1.95 0 -0.08
## TCGA-OR-A5J2 1.31 0 -0.84
## TCGA-OR-A5J3 1.25 0 1.18
## TCGA-OR-A5J4 2.60 1 NA
## TCGA-OR-A5J5 2.75 1 -1.00
## ... ... ... ...
## TCGA-PK-A5H9 2.00 0 -0.85
## TCGA-PK-A5HA 1.69 0 -1.49
## TCGA-PK-A5HB 1.64 0 -0.31
## TCGA-PK-A5HC 2.53 1 NA
## TCGA-P6-A5OG NA NA NA
#getClinicalNames(miniACC)
#Subset the MultiAssayExperiment to only include the three assays RNASeq2GeneNorm, gistict, and miRNASeqGene SummarizedExperiment
#multiassayexperiment[i = rownames, j = primary or colnames, k = assay]
miniACC.assays<-miniACC[, , c("RNASeq2GeneNorm", "gistict", "miRNASeqGene")]
## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 136 sampleMap rows not in names(experiments)
#complete.cases() shows which patients have complete data for all assays:
summary(complete.cases(miniACC.assays))
## Mode FALSE TRUE
## logical 15 77
#Subset MultiAssayExperiment to Obtain common samples
miniACC.assays.comp<-miniACC.assays[, complete.cases(miniACC.assays), ]
#complete.cases() shows which patients have complete data for all assays:
summary(complete.cases(miniACC.assays.comp))
## Mode TRUE
## logical 77
colData(miniACC.assays.comp)$patientID
## [1] "TCGA-OR-A5J1" "TCGA-OR-A5J2" "TCGA-OR-A5J3" "TCGA-OR-A5J5" "TCGA-OR-A5J6"
## [6] "TCGA-OR-A5J7" "TCGA-OR-A5J8" "TCGA-OR-A5J9" "TCGA-OR-A5JA" "TCGA-OR-A5JB"
## [11] "TCGA-OR-A5JC" "TCGA-OR-A5JD" "TCGA-OR-A5JE" "TCGA-OR-A5JF" "TCGA-OR-A5JG"
## [16] "TCGA-OR-A5JI" "TCGA-OR-A5JJ" "TCGA-OR-A5JK" "TCGA-OR-A5JL" "TCGA-OR-A5JM"
## [21] "TCGA-OR-A5JO" "TCGA-OR-A5JP" "TCGA-OR-A5JQ" "TCGA-OR-A5JR" "TCGA-OR-A5JS"
## [26] "TCGA-OR-A5JT" "TCGA-OR-A5JV" "TCGA-OR-A5JW" "TCGA-OR-A5JX" "TCGA-OR-A5JY"
## [31] "TCGA-OR-A5JZ" "TCGA-OR-A5K0" "TCGA-OR-A5K1" "TCGA-OR-A5K2" "TCGA-OR-A5K3"
## [36] "TCGA-OR-A5K4" "TCGA-OR-A5K5" "TCGA-OR-A5K6" "TCGA-OR-A5K8" "TCGA-OR-A5K9"
## [41] "TCGA-OR-A5KO" "TCGA-OR-A5KT" "TCGA-OR-A5KU" "TCGA-OR-A5KV" "TCGA-OR-A5KW"
## [46] "TCGA-OR-A5KX" "TCGA-OR-A5KY" "TCGA-OR-A5KZ" "TCGA-OR-A5L3" "TCGA-OR-A5L4"
## [51] "TCGA-OR-A5L5" "TCGA-OR-A5L6" "TCGA-OR-A5L8" "TCGA-OR-A5L9" "TCGA-OR-A5LA"
## [56] "TCGA-OR-A5LB" "TCGA-OR-A5LC" "TCGA-OR-A5LD" "TCGA-OR-A5LE" "TCGA-OR-A5LG"
## [61] "TCGA-OR-A5LH" "TCGA-OR-A5LJ" "TCGA-OR-A5LK" "TCGA-OR-A5LL" "TCGA-OR-A5LM"
## [66] "TCGA-OR-A5LN" "TCGA-OR-A5LO" "TCGA-OR-A5LP" "TCGA-OR-A5LR" "TCGA-OR-A5LS"
## [71] "TCGA-OR-A5LT" "TCGA-OU-A5PI" "TCGA-PA-A5YG" "TCGA-PK-A5H9" "TCGA-PK-A5HA"
## [76] "TCGA-PK-A5HB" "TCGA-P6-A5OG"
#More simply, intersectColumns() will select complete cases and rearrange each ExperimentList element
#so its columns correspond exactly to rows of colData in the same order:
#miniACC.assays.comp=intersectColumns(miniACC.assays)
#The column names of the assays in miniACC.sub.compmatch are not the same because of assay-specific identifiers,
#but they have been automatically re-arranged to correspond to the same patients. In these TCGA assays,
#the first three - delimited positions correspond to patient, i.e. the first patient is TCGA-OR-A5J1:
colnames(miniACC.assays.comp)
## CharacterList of length 3
## [["RNASeq2GeneNorm"]] TCGA-OR-A5J1-01A-11R-A29S-07 ...
## [["gistict"]] TCGA-OR-A5J1-01A-11D-A29H-01 ... TCGA-P6-A5OG-01A-22D-A29H-01
## [["miRNASeqGene"]] TCGA-OR-A5J1-01A-11R-A29W-13 ...
#intersectRows() keeps only rows that are common to each assay, and aligns them in identical order
#miniACC.assays.comp2 <- intersectRows(miniACC.assays.comp[, , c("RNASeq2GeneNorm","gistict","miRNASeqGene")])
rownames(miniACC.assays.comp)
## CharacterList of length 3
## [["RNASeq2GeneNorm"]] DIRAS3 MAPK14 YAP1 CDKN1B ... CHGA IDH3A SQSTM1 KCNJ13
## [["gistict"]] DIRAS3 MAPK14 YAP1 CDKN1B ERBB2 ... CHGA IDH3A SQSTM1 KCNJ13
## [["miRNASeqGene"]] hsa-let-7a-1 hsa-let-7a-2 ... hsa-mir-99a hsa-mir-99b
#Obtain age variable and study its frequency on the common samples. We will take variable years_to_birth
years_to_birth <- colData(miniACC.assays.comp)$years_to_birth
table(years_to_birth )
## years_to_birth
## 14 17 22 23 25 26 27 29 30 31 32 34 36 37 39 40 42 44 45 46 47 48 49 50 51 52
## 1 2 2 3 2 2 1 1 3 1 1 1 3 3 2 1 1 2 2 1 1 2 1 1 1 3
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 71 75 77
## 4 1 2 1 2 1 2 2 3 1 2 1 3 1 1 1 2 1 1 1
# plotting integer vector
barplot(years_to_birth, xlab = "Barplot of Patient Age",ylab = "Count", col = "white",col.axis = "darkgreen",col.lab = "darkgreen")
hist(years_to_birth, main = "Histogram of Patient Age",xlab = "Values",col.lab = "darkgreen",col.main = "darkgreen")
#Plot the histogram and overlay the density
hist(years_to_birth, freq = FALSE)
lines(density(years_to_birth))
#Then, we see that the distribution is normal and not bi-modal
#We use fitdistrplus package that provides tools for distribution fitting.
descdist(years_to_birth, discrete = FALSE)
## summary statistics
## ------
## min: 14 max: 77
## median: 49
## mean: 46.64935
## estimated sd: 15.94049
## estimated skewness: -0.2132373
## estimated kurtosis: 2.004782
#Now we attempt to fit different distributions:
normal_dist <- fitdist(years_to_birth, "norm")
#and inspect the fit:
plot(normal_dist)
#Now we attempt to fit different distributions:
binomial_dist <- fitdist(years_to_birth, "binom", fix.arg=list(size=77), start=list(prob=0.3))
#and inspect the fit:
plot(binomial_dist)
#We determine that years_to_birth follows a normal distribution
#The mean and SD are appropriate if the variable is somewhat symmetric. However, they can be misleading
#if the data are skewed (non-symmetric distribution) or there are outliers.
#The median and IQR can be used with any variable, but are typically used as an alternative to the mean
#and SD when the variable is skewed (not symmetric) or there are outliers since they are robust to skew and outliers.
#“Outliers” are values that are far away from the bulk of the values.
#Using the following functions to compute these statistics and study the continuous variable :
is.na (years_to_birth)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE
sum(is.na(years_to_birth)) # Number of missing values
## [1] 0
mean(years_to_birth, na.rm = T)
## [1] 46.64935
sd(years_to_birth, na.rm = T)
## [1] 15.94049
min(years_to_birth, na.rm = T)
## [1] 14
max(years_to_birth, na.rm = T)
## [1] 77
median(years_to_birth, na.rm = T)
## [1] 49
IQR(years_to_birth, na.rm = T)
## [1] 26
quantile(years_to_birth, probs = c(0,0.25,0.5,0.75,1))
## 0% 25% 50% 75% 100%
## 14 34 49 60 77
#df %>%
# group_by(n < 0) %>%
# top_n(2, abs(n)) %>%
# ungroup()
length(years_to_birth)
## [1] 77
#Extracting lowest 5 ages and highest 5 ages (low and high tails of normal distribution). Evaluating young patient ages in distribution
sort(years_to_birth)[1:5]
## [1] 14 17 17 22 22
#We have 5 unique values to choose in this range.Therefore:
young<-c(sort(years_to_birth)[1:5])
young
## [1] 14 17 17 22 22
#Evaluating old patient ages in distribution
old<-sort(years_to_birth,decreasing=F)[length(years_to_birth):(length(years_to_birth)-4)]
old
## [1] 77 75 71 69 69
#Now subset multi-assay experiment to only include those corresponding patients with selected age
combined.age<-c(young, old)
combined.age
## [1] 14 17 17 22 22 77 75 71 69 69
#Subsetting according to age of young and old patients
#multiassayexperiment[i = rownames, j = primary or colnames, k = assay]
selected.age <- miniACC.assays.comp$years_to_birth %in% combined.age
selected.age
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [13] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [61] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE
miniACC.assays.comp.age<-miniACC.assays.comp[, miniACC.assays.comp$years_to_birth %in% combined.age , ]
#Remove NA values from vector
#miniACC.comp.age.na<-miniACC.comp.age[, !is.na(miniACC.comp.age$years_to_birth %in% combined.age), ]
#Obtain common samples
#miniACC.sub.compmatch.age.na <- miniACC.sub.compmatch.age[, complete.cases(miniACC.sub.compmatch.age), ]
#miniACC.sub.compmatch.age
colData(miniACC.assays.comp.age)$patientID
## [1] "TCGA-OR-A5J9" "TCGA-OR-A5JE" "TCGA-OR-A5JF" "TCGA-OR-A5JI" "TCGA-OR-A5K0"
## [6] "TCGA-OR-A5KV" "TCGA-OR-A5L5" "TCGA-OR-A5LC" "TCGA-OR-A5LE" "TCGA-OR-A5LL"
#head(str(miniACC.assays.comp.age))
#Confirm dimensions and that correct indexes were used in extraction:
experiments(miniACC.assays.comp.age)
## ExperimentList class object of length 3:
## [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 10 columns
## [2] gistict: SummarizedExperiment with 198 rows and 10 columns
## [3] miRNASeqGene: SummarizedExperiment with 471 rows and 10 columns
sampleMap(miniACC.assays.comp.age)
## DataFrame with 30 rows and 3 columns
## assay primary colname
## <factor> <character> <character>
## 1 RNASeq2GeneNorm TCGA-OR-A5J9 TCGA-OR-A5J9-01A-11R..
## 2 RNASeq2GeneNorm TCGA-OR-A5JE TCGA-OR-A5JE-01A-11R..
## 3 RNASeq2GeneNorm TCGA-OR-A5JF TCGA-OR-A5JF-01A-11R..
## 4 RNASeq2GeneNorm TCGA-OR-A5JI TCGA-OR-A5JI-01A-11R..
## 5 RNASeq2GeneNorm TCGA-OR-A5K0 TCGA-OR-A5K0-01A-11R..
## ... ... ... ...
## 26 miRNASeqGene TCGA-OR-A5KV TCGA-OR-A5KV-01A-11R..
## 27 miRNASeqGene TCGA-OR-A5L5 TCGA-OR-A5L5-01A-11R..
## 28 miRNASeqGene TCGA-OR-A5LC TCGA-OR-A5LC-01A-11R..
## 29 miRNASeqGene TCGA-OR-A5LE TCGA-OR-A5LE-01A-11R..
## 30 miRNASeqGene TCGA-OR-A5LL TCGA-OR-A5LL-01A-11R..
metadata(miniACC.assays.comp.age)
## $title
## [1] "Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma"
##
## $PMID
## [1] "27165744"
##
## $sourceURL
## [1] "http://s3.amazonaws.com/multiassayexperiments/accMAEO.rds"
##
## $RPPAfeatureDataURL
## [1] "http://genomeportal.stanford.edu/pan-tcga/show_target_selection_file?filename=Allprotein.txt"
##
## $colDataExtrasURL
## [1] "http://www.cell.com/cms/attachment/2062093088/2063584534/mmc3.xlsx"
#Subset each each omics data (study object class and data type). We subset out each complete SummarizedExperiment we are interested in for separate,
#individual evaluation and for determining if samples are aligned
mACC.exp3 <- miniACC.assays.comp.age[[1]] #SummarizedExperiment
mACC.CN3 <- miniACC.assays.comp.age[[2]] #SummarizedExperiment
mACC.mir3 <- miniACC.assays.comp.age[[3]] #SummarizedExperiment
#data types
range(assay(mACC.exp3))
## [1] 0.0 206162.3
table(assay(mACC.CN3))
##
## -2 -1 0 1 2
## 3 336 1066 565 10
range(assay(mACC.mir3))
## [1] 0 2753979
rowData(mACC.exp3)
## DataFrame with 198 rows and 0 columns
metadata(mACC.exp3)
## $experimentData
## Experiment data
## Experimenter name:
## Laboratory:
## Contact information:
## Title:
## URL:
## PMIDs:
## No abstract available.
##
## $annotation
## character(0)
##
## $protocolData
## An object of class 'AnnotatedDataFrame': none
#Need to make sure that we have the same SAMPLES
s.exp3 <- substr(colnames(mACC.exp3),1,15)
s.CN3 <- substr(colnames(mACC.CN3),1,15)
s.mir3 <- substr(colnames(mACC.mir3),1,15)
s.common3 <- intersect(intersect(s.exp3,s.CN3),s.mir3)
TCGAutils::sampleTables(miniACC.assays.comp.age)
## $RNASeq2GeneNorm
##
## 01
## 10
##
## $gistict
##
## 01
## 10
##
## $miRNASeqGene
##
## 01
## 10
data(sampleTypes, package="TCGAutils")
sampleTypes
## Code Definition Short.Letter.Code
## 1 01 Primary Solid Tumor TP
## 2 02 Recurrent Solid Tumor TR
## 3 03 Primary Blood Derived Cancer - Peripheral Blood TB
## 4 04 Recurrent Blood Derived Cancer - Bone Marrow TRBM
## 5 05 Additional - New Primary TAP
## 6 06 Metastatic TM
## 7 07 Additional Metastatic TAM
## 8 08 Human Tumor Original Cells THOC
## 9 09 Primary Blood Derived Cancer - Bone Marrow TBM
## 10 10 Blood Derived Normal NB
## 11 11 Solid Tissue Normal NT
## 12 12 Buccal Cell Normal NBC
## 13 13 EBV Immortalized Normal NEBV
## 14 14 Bone Marrow Normal NBM
## 15 15 sample type 15 15SH
## 16 16 sample type 16 16SH
## 17 20 Control Analyte CELLC
## 18 40 Recurrent Blood Derived Cancer - Peripheral Blood TRB
## 19 50 Cell Lines CELL
## 20 60 Primary Xenograft Tissue XP
## 21 61 Cell Line Derived Xenograft Tissue XCL
## 22 99 sample type 99 99SH
#Select 01=Primary Solid tumor
#All samples are tumoral TP
mACC.exp.m3 <- assay(mACC.exp3)
mACC.exp.c3 <- mACC.exp.m3[,grep(paste(s.common3,collapse="|"),colnames(mACC.exp.m3),value = T)]
mACC.CN.m3 <- assay(mACC.CN3)
mACC.CN.c3 <- mACC.CN.m3[,grep(paste(s.common3,collapse="|"),colnames(mACC.CN.m3),value = T)]
mACC.mir.m3 <- assay(mACC.mir3)
mACC.mir.c3 <- mACC.mir.m3[,grep(paste(s.common3,collapse="|"),colnames(mACC.mir.m3),value = T)]
#check order and years_to_birth variable
cd3 <- colData(miniACC.assays.comp.age)
all.equal(rownames(cd3),substr(colnames(mACC.exp.c3),1,12))
## [1] TRUE
all.equal(rownames(cd3),substr(colnames(mACC.CN.c3),1,12))
## [1] TRUE
all.equal(rownames(cd3),substr(colnames(mACC.mir.c3),1,12))
## [1] TRUE
#ALL TRUE
# GLOBAL MFA variables
exp.l3<-nrow(mACC.exp.c3)
cn.l3<-nrow(mACC.CN.c3)
mir.l3<-nrow(mACC.mir.c3)
#Convert integer vector into factor with 2 levels (old, young) based on condition
colData(miniACC.assays.comp.age)$years_to_birth <- factor(ifelse(colData(miniACC.assays.comp.age)$years_to_birth>=68, "old", "young"))
table(colData(miniACC.assays.comp.age)$years_to_birth)
##
## old young
## 5 5
cond2<-colData(miniACC.assays.comp.age)$years_to_birth
cond2
## [1] young young old young old young old old young old
## Levels: old young
#Will later Confirm same patient ID and sample order
##############################################################################################################################################
#TO LATER IMPLEMENT THE CNVRanger function eqtl FOR CO-mRNA/CNV ANALYSIS, we require the initial INDIVIDUAL CNV CALL counts matrix and experiment that later
#gets processed into the GISTIC CNV GENE-BASED PEAK Experiment WHICH WE HAVE ALREADY FROM miniACC. THEREFORE, WE OBTAIN THIS INDIVIDUAL CNV CALL
#EXPERIMENT FROM TCGA AND ADD IT TO ORIGINAL MULTIASSAY EXPERIMENT OBJECT AS FOLLOWS:
miniACC.assays.comp.age.cnvcalls<-miniACC.assays.comp.age
cnv <- curatedTCGAData(diseaseCode = "ACC",assays = c("*CNV*"), version="1.1.38",dry.run = FALSE)
## Querying and downloading: ACC_CNVSNP-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## Querying and downloading: ACC_colData-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## Querying and downloading: ACC_metadata-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## Querying and downloading: ACC_sampleMap-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## harmonizing input:
## removing 825 sampleMap rows not in names(experiments)
test<-cnv[[1]]
#The c function allows the user to concatenate an additional experiment to an existing MultiAssayExperiment.
#The optional sampleMap argument allows concatenating an assay whose column names do not match the row names of colData.
#For convenience, the mapFrom argument (mapFrom=1L) allows the user to map from a particular experiment provided that the order of the colnames is in the same.
#A warning will be issued to make the user aware of this assumption.mapFrom=1L,mapFrom=1L
miniACC.assays.comp.age.cnvcalls<-c(miniACC.assays.comp.age.cnvcalls, newassay=cnv)
## Warning in `[<-.factor`(`*tmp*`, ri, value = c(58L, 44L, 23L, 23L, 30L, :
## invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, ri, value = c(58L, 44L, 23L, 23L, 30L, :
## invalid factor level, NA generated
#To annotate the genomic coordinates of the genes measured in the RNA-seq assay, we use the function symbolsToRanges from the TCGAutils package.
#In the cases where row annotations indicate gene symbols, the symbolsToRanges utility function converts genes to genomic ranges and replaces existing assays
#with RangedSummarizedExperiment objects. Gene annotations are given as 'hg19' genomic regions.
#Name of the genome is typically the name of an NCBI assembly (e.g. GRCh38.p13, WBcel235, TAIR10.1, ARS-UCD1.2, etc...)
#or UCSC genome(e.g. hg38, bosTau9, galGal6, ce11, etc...)
miniACC.assays.comp.age.cnvcalls.ranges <- TCGAutils::symbolsToRanges(miniACC.assays.comp.age.cnvcalls, unmapped=FALSE)
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 20 sampleMap rows not in names(experiments)
#microRNA assays obtained from curatedTCGAData have annotated sequences that can be converted to genomic ranges using the mirbase.db package.
#The function looks up all sequences and converts them to ('hg19') ranges. For those rows that cannot be found, an 'unranged' assay is introduced in the resulting MultiAssayExperiment object.
miniACC.assays.comp.age.cnvcalls.ranges <- mirToRanges(miniACC.assays.comp.age.cnvcalls.ranges)
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## harmonizing input:
## removing 10 sampleMap rows not in names(experiments)
#for(i in 1:4)
#{
# rr <- rowRanges(miniACC.assays.comp.age.cnvcalls.ranges[[i]])
# GenomeInfoDb::genome(rr) <- "hg19"
# GenomeInfoDb::seqlevelsStyle(rr) <- "UCSC"
# ind <- as.character(seqnames(rr)) %in% c("chr1","chr2","chr3", "chr4","chr5", "chr6","chr7", "chr8", "chr9","chr10","chr11", "chr12","chr13", "chr14","chr15", "chr16", "chr17", "chr18","chr19","chr20", "chr21","chr22", "chr23", "chrx")
# rowRanges(miniACC.assays.comp.age.cnvcalls.ranges[[i]]) <- rr
# miniACC.assays.comp.age.cnvcalls.ranges[[i]] <- miniACC.assays.comp.age.cnvcalls.ranges[[i]][ind,]
#}
#miniACC.assays.comp.age.cnvcalls.ranges
#We now restrict the analysis to intersecting patients of the three assays using MultiAssayExperiment’s intersectColumns function,
#and select Primary Solid Tumor samples using the splitAssays function from the TCGAutils package.
#miniACC.assays.comp.age.cnvcalls <- MultiAssayExperiment::intersectColumns(miniACC.assays.comp.age.cnvcalls)
#miniACC.assays.comp.age.cnvcalls<-miniACC.assays.comp.age.cnvcalls[, miniACC.assays.comp.age.cnvcalls$patientID %in%
#c("TCGA-OR-A5J9", "TCGA-OR-A5JE", "TCGA-OR-A5JF", "TCGA-OR-A5JI", "TCGA-OR-A5K0" ,"TCGA-OR-A5KV", "TCGA-OR-A5L5", "TCGA-OR-A5LC", "TCGA-OR-A5LE","TCGA-OR-A5LL" ) , ]
#miniACC.assays.comp.age.cnvcalls.ranges <- splitAssays(miniACC.assays.comp.age.cnvcalls.ranges, sampleCodes="01")
#Error: 'splitAssays' is not an exported object from 'namespace:TCGAutils'
#miniACC.assays.comp.age.cnvcalls.ranges <- splitAssays(miniACC.assays.comp.age.cnvcalls.ranges, c("01"))
#Error in splitAssays(miniACC.assays.comp.age.cnvcalls.ranges, c("01")) :
#is.list(hitList) || is(hitList, "List") is not TRUE
#Extracting individual summarized experiments which will be henceforth individually analyzed:
cnv_calls<-miniACC.assays.comp.age.cnvcalls.ranges[[1]]
cnv_calls
## class: RaggedExperiment
## dim: 21052 180
## assays(2): Num_Probes Segment_Mean
## rownames: NULL
## colnames(180): TCGA-OR-A5J1-01A-11D-A29H-01
## TCGA-OR-A5J1-10A-01D-A29K-01 ... TCGA-PK-A5HC-01A-11D-A309-01
## TCGA-PK-A5HC-11A-11D-A309-01
## colData names(0):
#head(assays(cnv_calls)$Num_Probes)
#head(assays(cnv_calls)$Segment_Mean)
mRNA_expr<-miniACC.assays.comp.age.cnvcalls.ranges[[2]]
mRNA_expr
## class: RangedSummarizedExperiment
## dim: 195 10
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(195): DIRAS3 MAPK14 ... SQSTM1 KCNJ13
## rowData names(1): gene_id
## colnames(10): TCGA-OR-A5J9-01A-11R-A29S-07 TCGA-OR-A5JE-01A-11R-A29S-07
## ... TCGA-OR-A5LE-01A-11R-A29S-07 TCGA-OR-A5LL-01A-11R-A29S-07
## colData names(0):
cnv_gistic<-miniACC.assays.comp.age.cnvcalls.ranges[[3]]
cnv_gistic
## class: RangedSummarizedExperiment
## dim: 195 10
## metadata(0):
## assays(1): ''
## rownames(195): DIRAS3 MAPK14 ... SQSTM1 KCNJ13
## rowData names(4): Gene.Symbol Locus.ID Cytoband gene_id
## colnames(10): TCGA-OR-A5J9-01A-11D-A29H-01 TCGA-OR-A5JE-01A-11D-A29H-01
## ... TCGA-OR-A5LE-01A-11D-A29H-01 TCGA-OR-A5LL-01A-11D-A29H-01
## colData names(0):
miRNA_expr<-miniACC.assays.comp.age.cnvcalls.ranges[[4]]
miRNA_expr
## class: RangedSummarizedExperiment
## dim: 448 10
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(448): hsa-let-7a-1 hsa-let-7a-2 ... hsa-mir-99a hsa-mir-99b
## rowData names(1): mirna_id
## colnames(10): TCGA-OR-A5J9-01A-11R-A29W-13 TCGA-OR-A5JE-01A-11R-A29W-13
## ... TCGA-OR-A5LE-01A-11R-A29W-13 TCGA-OR-A5LL-01A-11R-A29W-13
## colData names(0):
miRNA_expr_unranged<-miniACC.assays.comp.age.cnvcalls.ranges[[5]]
miRNA_expr_unranged
## class: SummarizedExperiment
## dim: 23 10
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(23): hsa-mir-103-1 hsa-mir-103-2 ... hsa-mir-663 hsa-mir-664
## rowData names(0):
## colnames(10): TCGA-OR-A5J9-01A-11R-A29W-13 TCGA-OR-A5JE-01A-11R-A29W-13
## ... TCGA-OR-A5LE-01A-11R-A29W-13 TCGA-OR-A5LL-01A-11R-A29W-13
## colData names(0):
#We will henceforth analyze the individual summarized experiments extracted from the MULTIASSAY EXPERIMENT miniACC.assays.comp.age
#(1) PRIOR TO INCLUSION of the additional summarized experiment for INDIVIDUAL CNV_CALLS because we were unsuccessful in equalizing the samples and patients
#across all summarized experiments including th individual calls experiment, and
#(2) PRIOR TO EXECUTION OF THE mirToRanges function because this unfortunately segregated the miRNA Summarized Experpiment into Ranged and Unranged Experiments
mRNA-Seq DATA BLOCK ANALYSIS
#Preliminary analysis of individual extracted mRNA-seq Summarized Experiment:
#Creating a phenotype dataframe for mRNA expression:
phenoN <- data.frame(sample=colnames(mACC.exp.c3),patientID=colData(miniACC.assays.comp.age)$patientID, age.status=colData(miniACC.assays.comp.age)$years_to_birth)
rownames(phenoN)<-phenoN$sample
countsM <- as.matrix(assays(mACC.exp3)$exprs)
#These are identical matrixes
#The GENE IDs appear to be HGNC. For instance, DIRAS3 is HGNC symbol for Homo sapiens (human)family GTPase 3 according to website: https://www.ncbi.nlm.nih.gov/gene/9077
sum(is.na(countsM))
## [1] 0
#As part of the exploration, we plot data
boxplot(countsM) #They didn't apply log2 on the TMM for transformation
boxplot(log2(countsM+2))
#Check Library size
lSize <- colSums(countsM)
lSize #all sample sums < 1M (not = 1M as expected for TMM normalization) and non-homogeneous
## TCGA-OR-A5J9-01A-11R-A29S-07 TCGA-OR-A5JE-01A-11R-A29S-07
## 533661.8 698097.1
## TCGA-OR-A5JF-01A-11R-A29S-07 TCGA-OR-A5JI-01A-11R-A29S-07
## 555528.0 648939.3
## TCGA-OR-A5K0-01A-11R-A29S-07 TCGA-OR-A5KV-01A-11R-A29S-07
## 562191.3 727214.2
## TCGA-OR-A5L5-01A-11R-A29S-07 TCGA-OR-A5LC-01A-11R-A29S-07
## 642192.1 560695.8
## TCGA-OR-A5LE-01A-11R-A29S-07 TCGA-OR-A5LL-01A-11R-A29S-07
## 592894.2 511599.7
#We study total of reads per sample (library size).
sampleT <- apply(countsM, 2, sum)/10^6
range(sampleT)
## [1] 0.5115997 0.7272142
sampleTDF <- data.frame(sample=names(sampleT), total=sampleT)
p <- ggplot(aes(x=sample, y=sampleT, fill=sampleT), data=sampleTDF) + geom_bar(stat="identity")
p + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("")
#One of the characteristics of RNA-seq data is that it contains a lot of zeros,
#corresponding to genes that are not expressed. It is therefore important to remove genes that
#consistently have zero or very low counts. In this case we will only keep genes that have at
#least 10 reads in at least 4 samples. One recommendation for the number of samples would be set
#to the smallest group size. Our "old" group and "young" group have 5 samples each (10 patients, 10 samples total)
keep <- rowSums(countsM > 10) >= 5 # at least 5 samples have 10 reads per gene
countsF <- countsM[keep,]
#There are several methods that can be used to normalize values in count matrices.
#Traditionally, CPM (Counts Per Million), RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million)
#were used to report RNA-seq results. However, TPM (Transcripts Per Kilobase Million) is now more popular.
#CPM divide the counts by library size whereas RPKM/FPKM and TPM scale the data using gene length and library size.
#When comparing samples, TMM (Trimmed Mean on the M-values) is the standard method to report results. Other methods
#include also the GC content in the normalization step.
#Gene length
#To normalize using RPKM, FPKM or TPM we will need the gene length.
#Let’s obtain this information throughout biomaRt.
#The Cancer Genome Atlas (TCGA) uses GENCODE 36 (GRCh38/hg38) as a reference gene model
#The GENCODE annotation is made by merging the manual gene annotation produced by the Ensembl-Havana team and the Ensembl-genebuild automated gene annotation.
#GENCODE version 36 corresponds to Ensembl 102 based on the website:https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=wgEncodeGencodeSuper
#We will take version 102 from the available archived. We will extract HGNC ID, chromosome, start and end to compute later or the gene length.
listMarts()
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 112
## 2 ENSEMBL_MART_MOUSE Mouse strains 112
## 3 ENSEMBL_MART_SNP Ensembl Variation 112
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 112
listEnsemblArchives()
## name date url version
## 1 Ensembl GRCh37 Feb 2014 https://grch37.ensembl.org GRCh37
## 2 Ensembl 112 May 2024 https://may2024.archive.ensembl.org 112
## 3 Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org 111
## 4 Ensembl 110 Jul 2023 https://jul2023.archive.ensembl.org 110
## 5 Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org 109
## 6 Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org 108
## 7 Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org 107
## 8 Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org 106
## 9 Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org 105
## 10 Ensembl 104 May 2021 https://may2021.archive.ensembl.org 104
## 11 Ensembl 103 Feb 2021 https://feb2021.archive.ensembl.org 103
## 12 Ensembl 102 Nov 2020 https://nov2020.archive.ensembl.org 102
## 13 Ensembl 101 Aug 2020 https://aug2020.archive.ensembl.org 101
## 14 Ensembl 100 Apr 2020 https://apr2020.archive.ensembl.org 100
## 15 Ensembl 99 Jan 2020 https://jan2020.archive.ensembl.org 99
## 16 Ensembl 98 Sep 2019 https://sep2019.archive.ensembl.org 98
## 17 Ensembl 97 Jul 2019 https://jul2019.archive.ensembl.org 97
## 18 Ensembl 80 May 2015 https://may2015.archive.ensembl.org 80
## 19 Ensembl 77 Oct 2014 https://oct2014.archive.ensembl.org 77
## 20 Ensembl 75 Feb 2014 https://feb2014.archive.ensembl.org 75
## 21 Ensembl 54 May 2009 https://may2009.archive.ensembl.org 54
## current_release
## 1
## 2 *
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
## 20
## 21
#Taking version 102
listEnsembl(version = 102)
## biomart version
## 1 genes Ensembl Genes 102
## 2 mouse_strains Mouse strains 102
## 3 snps Ensembl Variation 102
## 4 regulation Ensembl Regulation 102
ensembl102 <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl',version = 102)
#listDatasets(ensembl102)
attributes = listAttributes(ensembl102)
attributes[1:5,]
## name description page
## 1 ensembl_gene_id Gene stable ID feature_page
## 2 ensembl_gene_id_version Gene stable ID version feature_page
## 3 ensembl_transcript_id Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5 ensembl_peptide_id Protein stable ID feature_page
#searchAttributes(mart = ensembl102, pattern = "hgnc_symbol")
#searchAttributes(mart = ensembl102, pattern = "position")
#searchAttributes(mart = ensembl102, pattern = "length")
#searchAttributes(mart = ensembl102, pattern = "ensembl.*id")
searchAttributes(mart = ensembl102, pattern = "entrez.*id")
## name description page
## 79 entrezgene_id NCBI gene (formerly Entrezgene) ID feature_page
filters = listFilters(ensembl102)
filters[1:5,]
## name description
## 1 chromosome_name Chromosome/scaffold name
## 2 start Start
## 3 end End
## 4 band_start Band Start
## 5 band_end Band End
searchFilters(mart = ensembl102, pattern = "hgnc_symbol")
## name description
## 81 hgnc_symbol HGNC symbol(s) [e.g. A1BG]
searchFilters(mart = ensembl102, pattern = "hgnc_symbol")
## name description
## 81 hgnc_symbol HGNC symbol(s) [e.g. A1BG]
head(searchFilters(mart = ensembl102, pattern = "ensembl.*id"))
## name
## 56 ensembl_gene_id
## 57 ensembl_gene_id_version
## 58 ensembl_transcript_id
## 59 ensembl_transcript_id_version
## 60 ensembl_peptide_id
## 61 ensembl_peptide_id_version
## description
## 56 Gene stable ID(s) [e.g. ENSG00000000003]
## 57 Gene stable ID(s) with version [e.g. ENSG00000000003.15]
## 58 Transcript stable ID(s) [e.g. ENST00000000233]
## 59 Transcript stable ID(s) with version [e.g. ENST00000000233.10]
## 60 Protein stable ID(s) [e.g. ENSP00000000233]
## 61 Protein stable ID(s) with version [e.g. ENSP00000000233.5]
gensInfo<-getBM(attributes=c("hgnc_symbol","ensembl_gene_id","chromosome_name","start_position","end_position","entrezgene_id","hgnc_symbol","description" ), filters=c("hgnc_symbol"), values=list(rownames(countsF)), mart=ensembl102)
gensInfo$length <- gensInfo$end_position - gensInfo$start_position
range(gensInfo$length)
## [1] 2403 824272
dim(gensInfo) #notice different length of genes, there are some repetitions and some missing values
## [1] 195 9
table(duplicated(gensInfo$hgnc_symbol)) #some
##
## FALSE TRUE
## 181 14
gensInfo[duplicated(gensInfo$hgnc_symbol),]#just a miRNA
## hgnc_symbol ensembl_gene_id chromosome_name start_position
## 2 ACACA ENSG00000278540 17 37084992
## 10 AKT3 ENSG00000117020 1 243488233
## 48 CLDN7 ENSG00000181885 17 7259903
## 58 EEF2K ENSG00000103319 16 22206278
## 80 HSPA1A ENSG00000234475 CHR_HSCHR6_MHC_DBB_CTG1 31797650
## 81 HSPA1A ENSG00000237724 CHR_HSCHR6_MHC_COX_CTG1 31802834
## 82 HSPA1A ENSG00000215328 CHR_HSCHR6_MHC_QBL_CTG1 31805699
## 83 HSPA1A ENSG00000204389 6 31815543
## 103 MAPT ENSG00000276155 CHR_HSCHR17_1_CTG5 46069784
## 104 MAPT ENSG00000186868 17 45894551
## 111 MYH11 ENSG00000133392 16 15703135
## 141 PTEN ENSG00000171862 10 87863625
## 156 RPS6KA1 ENSG00000117676 1 26529761
## 194 YWHAE ENSG00000108953 17 1344275
## end_position entrezgene_id hgnc_symbol.1
## 2 37406836 31 ACACA
## 10 243851079 10000 AKT3
## 48 7263983 1366 CLDN7
## 58 22288738 29904 EEF2K
## 80 31800132 3303 HSPA1A
## 81 31805316 3303 HSPA1A
## 82 31808181 3303 HSPA1A
## 83 31817946 3303 HSPA1A
## 103 46203150 4137 MAPT
## 104 46028334 4137 MAPT
## 111 15857028 4629 MYH11
## 141 87971930 5728 PTEN
## 156 26575030 6195 RPS6KA1
## 194 1400222 7531 YWHAE
## description
## 2 acetyl-CoA carboxylase alpha [Source:HGNC Symbol;Acc:HGNC:84]
## 10 AKT serine/threonine kinase 3 [Source:HGNC Symbol;Acc:HGNC:393]
## 48 claudin 7 [Source:HGNC Symbol;Acc:HGNC:2049]
## 58 eukaryotic elongation factor 2 kinase [Source:HGNC Symbol;Acc:HGNC:24615]
## 80 heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 81 heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 82 heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 83 heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 103 microtubule associated protein tau [Source:HGNC Symbol;Acc:HGNC:6893]
## 104 microtubule associated protein tau [Source:HGNC Symbol;Acc:HGNC:6893]
## 111 myosin heavy chain 11 [Source:HGNC Symbol;Acc:HGNC:7569]
## 141 phosphatase and tensin homolog [Source:HGNC Symbol;Acc:HGNC:9588]
## 156 ribosomal protein S6 kinase A1 [Source:HGNC Symbol;Acc:HGNC:10430]
## 194 tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein epsilon [Source:HGNC Symbol;Acc:HGNC:12851]
## length
## 2 321844
## 10 362846
## 48 4080
## 58 82460
## 80 2482
## 81 2482
## 82 2482
## 83 2403
## 103 133366
## 104 133783
## 111 153893
## 141 108305
## 156 45269
## 194 55947
length(setdiff(rownames(countsF), gensInfo$hgnc_symbol))
## [1] 1
countsFDF <- data.frame(ID=rownames(countsF),countsF)
countsFInfo <- right_join(countsFDF, gensInfo, by=c("ID"="hgnc_symbol"))
countsFInfo <- countsFInfo[!duplicated(countsFInfo$ID),] #After having checked duplications, just keep first result
countsFInfo_backup<-countsFInfo
colnames(countsFInfo_backup)[colnames(countsFInfo_backup) == 'chromosome_name'] <- 'chr'
colnames(countsFInfo_backup)[colnames(countsFInfo_backup) == 'start_position'] <- 'start'
colnames(countsFInfo_backup)[colnames(countsFInfo_backup) == 'end_position'] <- 'end'
#Chromosome names that are missing or erroneous need to be fixed:
countsFInfo_backup[countsFInfo_backup$ID == "RPS6KA1", "chr"] <- "1"
countsFInfo_backup[countsFInfo_backup$ID == "AKT3", "chr"] <- "1"
countsFInfo_backup[countsFInfo_backup$ID == "CLDN7", "chr"] <- "17"
countsFInfo_backup[countsFInfo_backup$ID == "PTEN", "chr"] <- "10"
countsFInfo_backup[countsFInfo_backup$ID == "YWHAE", "chr"] <- "17"
countsFInfo_backup[countsFInfo_backup$ID == "MAPT", "chr"] <- "17"
countsFInfo_backup[countsFInfo_backup$ID == "ACACA", "chr"] <- "17"
countsFInfo_backup[countsFInfo_backup$ID == "EEF2K", "chr"] <- "16"
countsFInfo_backup[countsFInfo_backup$ID == "MYH11", "chr"] <- "16"
countsFInfo_backup[countsFInfo_backup$ID == "HSPA1A", "chr"] <- "6"
countsFInfo_backup[countsFInfo_backup$ID == "CHGA", "chr"] <- "14"
countsFInfo_backup$chr<-paste0("chr", countsFInfo_backup$chr )
#To perform FPKM (for paired-end reads) or RPKM (for single-end reads), we first divide by the library size and then by gene length.
#The sum of each sample after FPKM normalization is different.
#step 1: normalize for read depth and multiply by million
readD <- apply(countsFInfo[,2:11], 2, function(x) x / sum(x) * 10^6)
#step 2. scale by gene length and multiply by thousand
countsFPKM <- readD / countsFInfo$length * 10^3
colSums(countsFPKM)
## TCGA.OR.A5J9.01A.11R.A29S.07 TCGA.OR.A5JE.01A.11R.A29S.07
## 95486.10 134318.90
## TCGA.OR.A5JF.01A.11R.A29S.07 TCGA.OR.A5JI.01A.11R.A29S.07
## 101615.03 123874.63
## TCGA.OR.A5K0.01A.11R.A29S.07 TCGA.OR.A5KV.01A.11R.A29S.07
## 111547.13 131227.77
## TCGA.OR.A5L5.01A.11R.A29S.07 TCGA.OR.A5LC.01A.11R.A29S.07
## 118024.86 117143.21
## TCGA.OR.A5LE.01A.11R.A29S.07 TCGA.OR.A5LL.01A.11R.A29S.07
## 113831.66 74559.88
#To perform TPM, we first divide by the gene length and then we divide by the transformed sequencing depth.
#Check that the sum of each column after TPM normalization equals to 10^6.
# sampleTF <- colSums(countsFInfo[,2:11])
#step 1: divide by gene length and multiply by thousand to obtain the reads per kilobase (RPK)
rpk <- countsFInfo[,2:11] / countsFInfo$length * 10^3
#step 2: divide by sequencing depth and multiply by million
countsTPM <- apply(rpk, 2, function(x) x / sum(x) * 10^6)
#check totals (All equal to 1 million)
colSums(countsTPM)
## TCGA.OR.A5J9.01A.11R.A29S.07 TCGA.OR.A5JE.01A.11R.A29S.07
## 1e+06 1e+06
## TCGA.OR.A5JF.01A.11R.A29S.07 TCGA.OR.A5JI.01A.11R.A29S.07
## 1e+06 1e+06
## TCGA.OR.A5K0.01A.11R.A29S.07 TCGA.OR.A5KV.01A.11R.A29S.07
## 1e+06 1e+06
## TCGA.OR.A5L5.01A.11R.A29S.07 TCGA.OR.A5LC.01A.11R.A29S.07
## 1e+06 1e+06
## TCGA.OR.A5LE.01A.11R.A29S.07 TCGA.OR.A5LL.01A.11R.A29S.07
## 1e+06 1e+06
#PREPARING DATAFRAME FOR LATER CNV VS. mRNA-Seq CORRELATION ANALYSIS AND MFA
countsF_TPM_LOG<-log2(countsTPM[,1:10]+2)
countsF_TPM_LOG_DF<-as.data.frame(countsF_TPM_LOG)
countsF_TPM_LOG_DF$ID<-countsFInfo_backup$ID
countsF_TPM_LOG_DF$chr<-countsFInfo_backup$chr
countsF_TPM_LOG_DF$start<-countsFInfo_backup$start
countsF_TPM_LOG_DF$end<-countsFInfo_backup$end
#PCA for mRNA-Seq
countsF_TPM_LOG_DF_PCAMFA<-countsF_TPM_LOG_DF[,1:10]
#Transpose
countsF_TPM_LOG_DF_PCAMFA.t<-t(countsF_TPM_LOG_DF_PCAMFA)
#Assign names, we include a exp suffix to differentiate genes from cnv
colnames(countsF_TPM_LOG_DF_PCAMFA.t)<-paste(countsF_TPM_LOG_DF$ID,"exp",sep=".")
#Construct data.frame to perform PCA
expr4pca<-data.frame(cond2,countsF_TPM_LOG_DF_PCAMFA.t)
res.pca.expr<-PCA(expr4pca,quali.sup=1)
res.pca.expr
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 10 individuals, described by 182 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$quali.sup" "results for the supplementary categorical variables"
## 12 "$quali.sup$coord" "coord. for the supplementary categories"
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"
## 14 "$call" "summary statistics"
## 15 "$call$centre" "mean of the variables"
## 16 "$call$ecart.type" "standard error of the variables"
## 17 "$call$row.w" "weights for the individuals"
## 18 "$call$col.w" "weights for the variables"
plot(res.pca.expr,habillage=1)
#We observe differences between the young and old patient samples (in dim 1 and dim2)
#FPKM and TPM account for gene length and library size per sample but do not take into account the rest of the samples
#belonging to the experiment. There are situations in which some genes can accumulate high rates of reads.
#To correct for these imbalance in the counts composition there are methods such as the Trimmed Mean of M-values (TMM),
#included in the package edgeR. This normalization is suitable for comparing among the samples, for instance when performing sample
#aggregations.
#Normalization using TMM (edgeR package)
d <- DGEList(counts = countsF)
Norm.Factor <- calcNormFactors(d, method = "TMM")
countsTMM <- cpm(Norm.Factor, log = T)
countsTMMnoLog <- cpm(Norm.Factor, log = F)
#Observing how distribution of the three normalizations (in log2) change (for the first sample).
hist(log2(countsFPKM[,1]+2), xlab="log2-ratio", main="FPKM")
#Appears to be a normal distribution of log2-ratios
hist(log2(countsTPM[,1]+2), xlab="log2-ratio", main="TPM")
#Appears to be a normal distribution of log2-ratios
#We will later need gene ID to be included to filtered, TPM-normalized, log-transformed mRNA=seq counts matrix
#For future mRNA-seq vs. GISTIC CNV correlation analysis:
hist(countsTMM[,1], xlab="log2-ratio", main="TMM")
#Appears to be a normal distribution of log2-ratios
#To see how samples aggregate, we will perform hierarchical clustering as well as PCA.
#The purpose is to see whether samples aggregate by condition or there are some outliers, that might have a biological or technical causes.
#Hierarchical clustering
x_rna<-countsTMM
#Euclidean distance
clust.cor.ward <- hclust(dist(t(x_rna)),method="ward.D2")
plot(clust.cor.ward, main="hierarchical clustering", hang=-1,cex=0.8)
#The ward.D2 hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients
clust.cor.average <- hclust(dist(t(x_rna)),method="average")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8)
#The average hierarchal clustering DOES NOT appear to reflect the segregation of 5 old and 5 young patients
clust.cor.average <- hclust(dist(t(x_rna)),method="complete")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8)
#The complete hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients
#Correlation based distance
clust.cor.ward <- hclust(as.dist(1-cor(x_rna)),method="ward.D2")
plot(clust.cor.ward, main="hierarchical clustering", hang=-1,cex=0.8)
#The ward.D2 hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients
clust.cor.average<- hclust(as.dist(1-cor(x_rna)),method="average")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8)
#The average hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients
#Data Preparation
cond2<-phenoN$age.status
countsF_backup<-as.matrix(countsF)
sum1<-sum(is.na(countsF_backup))
sum1
## [1] 0
#Density plot of raw read counts (log10)
countsSF_backup_log <- log(countsF_backup,10)
d <- density(countsSF_backup_log)
plot(d,xlim=c(1,8),main="",ylim=c(0,.45),xlab="Raw filtered read counts per gene after log10 transformation)", ylab="Density")
for (s in 1:length(colnames(countsSF_backup_log))){
countsSF_backup_log <- log(countsF_backup[,s],10)
d <- density(countsSF_backup_log)
lines(d)
}
#Box plots of raw filtered read counts after log10 transformation
countsSF_backup_log <- log(countsF_backup,10)
boxplot(countsSF_backup_log , main="", xlab="", ylab="Raw read counts per gene after log10 transformation)",axes=FALSE)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 8 is not drawn
axis(2)
axis(1,at=c(1:length(colnames(countsSF_backup_log))),labels=colnames(countsSF_backup_log),las=2,cex.axis=0.8)
#Plot Heatmap with condition age.status as labels
colnames(countsF_backup)<-phenoN$age.status
heatmap(countsF_backup, col = topo.colors(50), margin=c(10,6))
#Heatmap reveals that Old patients were relatively underexpressing more mRNA genes
# PCA
#library ggfortify needed for the autoplot to understand and plot PCA results
summary(pca.filt <- prcomp(t(x_rna), scale=T ))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 6.7546 5.9800 4.8424 4.4924 4.15924 3.62842 3.47849
## Proportion of Variance 0.2507 0.1965 0.1288 0.1109 0.09505 0.07234 0.06648
## Cumulative Proportion 0.2507 0.4472 0.5760 0.6869 0.78196 0.85429 0.92078
## PC8 PC9 PC10
## Standard deviation 3.0315 2.28660 1.072e-14
## Proportion of Variance 0.0505 0.02873 0.000e+00
## Cumulative Proportion 0.9713 1.00000 1.000e+00
autoplot(pca.filt, data=phenoN, colour="patientID", shape="age.status")
#There does not appear to be segregation by age status
#Note that a total of 25.07%+ 19.65%=44.72% variance is accounted for by the
#first 2 principal components PC1 and PC2 and corresponding eigenvector values
#RNA-seq are represented by counts matrices and therefore linear models like those implemented in limma
#cannot be directly applied. There are several options we can take:
#1.Transform counts matrices and apply limma
#2.Use specific methods that account for count data distribution
#The voom transformation is used for the first limma approach and DESeq2 accounting
#for Negative Binomial distribution of the data is used in second approach.
#The limma approach for RNA-seq converts read counts to log2-counts-per-million (logCPM) and the mean-variance relationship
#is modeled either with precision weights (the voom approach) or with an empirical Bayes prior trend (the limma-trend approach).
#Voom estimates the mean-variance relationship of the log-counts and creates weights that are later on used by limma.
#Applying the voom transformation and the limma model to perform differentially expressed genes using variable cond.
cond2<-phenoN$age.status
design <- model.matrix(~0+cond2)
rownames(design) <- phenoN$sample
colnames(design) <- gsub("cond2", "", colnames(design))
voom.res <- voom(countsF, design, plot = T)
#Model fit
fit <- lmFit(voom.res, design)
#contrasts
contrast.matrix <- makeContrasts(con1=old-young,levels = design)
#contrasts fit and Bayesian adjustment
fit2 <- contrasts.fit(fit, contrast.matrix)
fite <- eBayes(fit2)
#summary
summary(decideTests(fite, method = "separate"))
## con1
## Down 0
## NotSig 182
## Up 0
#In case we cannot adjust for multiple comparisons, not advisable
summary(decideTests(fite, adjust.method = "none", method = "separate"))
## con1
## Down 4
## NotSig 173
## Up 5
#global model
top.table <- topTable(fite, number = Inf, adjust = "fdr")
#Now study how p-values behave. Under the null hypothesis, p-values are expected to have a uniform distribution.
hist(top.table$P.Value, breaks = 100, main = "results P")
#No significant results were obtained at FDR < 0.05 and the distribution of p-values
#shows that there is some variability that was not considered in the model.
#We do not later include other colData (multiassay experimental) variables in the model to see whether results improve.
#DESeq2 on SUMMARIZED EXPERIMENT:
#As input, the DESeq2 package expects raw count data in the form of a matrix of integer values.
#The DESeq2 model internally corrects for library size, so transformed or normalized values such as counts
#scaled by library size should not be used as input. The estimates of dispersion and logarithmic fold changes incorporate data-driven prior distributions.
#ddsSE <- DESeqDataSet(mACC.exp3, design = ~ colnames(mACC.exp3))
#ddsSE
#filtering
#keep <- rowSums(counts(ddsSE) >= 10) >= 5
#ddsSE <- ddsSE[keep,]
sum_na<-sum(is.na(countsF))
#DESeq2 on COUNT MATRIX:
#Filtering is also advised by DESeq2, so we will create the DESeqDataSet from the filtered counts matrix.
countsF_int<-countsF
object.size(countsF_int)
## 27640 bytes
mode(countsF_int) <- "integer"
object.size(countsF_int)
## 20360 bytes
dds <- DESeqDataSetFromMatrix(countData = countsF_int,colData = phenoN,design = ~ age.status)
#To benefit from the default settings of the package, you should put the variable of interest at
#the end of the formula and make sure the control level is the first level. This is not necessary if contrast option is used as here
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# Global model
resG <- results(dds, alpha=0.05) #lfcThreshold is by default 0
summary(resG)
##
## out of 182 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 0, 0%
## LFC < 0 (down) : 0, 0%
## outliers [1] : 2, 1.1%
## low counts [2] : 0, 0%
## (mean count < 25)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
#Contrasts, we just check two of them
res1 <- results(dds, contrast=c("age.status","old","young"))
summary(res1)
##
## out of 182 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 3, 1.6%
## LFC < 0 (down) : 2, 1.1%
## outliers [1] : 2, 1.1%
## low counts [2] : 0, 0%
## (mean count < 25)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
res1DF <- as.data.frame(res1)
res1DFS <- res1DF[order(res1DF$pvalue),]
res1DFSign <- res1DFS[!is.na(res1DFS$pvalue) & res1DFS$pvalue<0.05, ]
res1DFSign
## baseMean log2FoldChange lfcSE stat pvalue padj
## ITGA2 1327.6951 2.7360301 0.7603356 3.598451 0.0003201186 0.05153136
## TGM2 2026.5741 1.7482773 0.5176150 3.377563 0.0007313117 0.05153136
## CDKN2A 489.3470 -2.1809438 0.6543299 -3.333095 0.0008588560 0.05153136
## NRAS 642.2827 -1.4122025 0.4652978 -3.035051 0.0024049521 0.09536657
## ASNS 541.1699 1.3216405 0.4397008 3.005772 0.0026490714 0.09536657
## EGFR 255.6320 2.2555207 0.8678233 2.599055 0.0093480731 0.24120018
## XBP1 2126.5317 1.2390244 0.4769359 2.597884 0.0093800071 0.24120018
## SYK 283.9256 2.8631917 1.1340265 2.524801 0.0115763723 0.25844433
## ADAR 7898.8241 1.7509197 0.7043387 2.485906 0.0129222165 0.25844433
## TSC2 1454.1945 0.4284242 0.1777503 2.410259 0.0159412085 0.27142621
## MAPK9 1224.1549 0.8624657 0.3600007 2.395733 0.0165871575 0.27142621
## SHC1 4507.9093 1.4601221 0.6344295 2.301473 0.0213649360 0.31025225
## RAD50 1365.3078 1.0988116 0.4854012 2.263718 0.0235914522 0.31025225
## FASN 6331.0574 -1.8195275 0.8121507 -2.240382 0.0250661595 0.31025225
## SERPINE1 1642.4591 1.8487470 0.8296326 2.228392 0.0258543541 0.31025225
## PIK3R1 1677.0962 1.9490034 0.9172863 2.124749 0.0336075470 0.37808490
## AKT1S1 2463.3840 -0.9591121 0.4677636 -2.050421 0.0403234056 0.42695371
## YBX1 5622.6716 -0.9009675 0.4514227 -1.995840 0.0459513352 0.43659251
#From DESeq2 model, there are 3 statistically differentially overexpressed (ITGA2, TGM2, ASNS)
#and 2 statistically differentially underexpressed genes (CDKN2A, NRAS) identified:
#Results of limma and DESeq2 can be visualized using volcano plots and heatmaps.
#We will just create plots for the first contrast.
#Volcano plot
colorS <- c("blue", "grey", "red")
#CHECK p or p.adj
#specific parameters
showGenes <- 20 #genes to be displayed with names
dataV <- topTable(fite, n = Inf, coef = "con1", adjust = "fdr")
dataV <- dataV %>% mutate(gene = rownames(dataV), logp = -(log10(P.Value)), logadjp = -(log10(adj.P.Val)),
FC = ifelse(logFC>0, 2^logFC, -(2^abs(logFC)))) %>%
mutate(sig = ifelse(P.Value<0.01 & logFC > 1, "UP", ifelse(P.Value<0.01 & logFC < (-1), "DN","n.s"))) #ideally we should have an adj.P.Val < 0.05
p <- ggplot(data=dataV, aes(x=logFC, y=logp )) +
geom_point(alpha = 1, size= 1, aes(col = sig)) +
scale_color_manual(values = colorS) +
xlab(expression("log"[2]*"FC")) + ylab(expression("-log"[10]*"(p.val)")) + labs(col=" ") +
geom_vline(xintercept = 1, linetype= "dotted") + geom_vline(xintercept = -1, linetype= "dotted") +
geom_hline(yintercept = -log10(0.1), linetype= "dotted") + theme_bw()
p <- p + geom_text_repel(data = head(dataV[dataV$sig != "n.s",],showGenes), aes(label = gene))
print(p)
#Evidently, expression of gene CDKN2A is significantly downregulated and expression of gene TGM2 is upregulated
#as a function of age (young/old)
#Heatmap
#Plotting heatmap results for the limma model (without adjusting for variable patientID).
t1 <- topTable(fite, n = Inf, coef = "con1", adjust = "fdr")
res1 <- t1[t1$P.Value<0.01 & abs(t1$logFC) > 1,]
data.clus <- countsTMM[rownames(res1),]
cond2.df <- as.data.frame(cond2)
rownames(cond2.df) <- colnames(data.clus)
pheatmap(data.clus, scale = "row", show_rownames = TRUE, annotation_col = cond2.df)
#Evidently, TGM2 is overexpressed in old patients and underexpressed in young patients
#CDKN2A is overexpressed in young patients. CDKN2A is abberantly downregulated in the Old Patient A5LC
#GENE ANNOTATION AND GENE ONTOLOGY:
#Load the library
#The central ID for org.Hs.eg.db, a genome-wide annotation for humans based on Entrez Gene, is the NCBI Gene ID.
#org.Hs.egACCNUM is an R object that contains mappings between Entrez Gene identifiers and
#GenBank accession numbers.
#Define list of 5 genes of interest (DE genes - EntrezGene IDs)
gene_entrez1<-countsFInfo[countsFInfo$ID == rownames(res1DFSign)[1],16]#OVER
gene_entrez2<-countsFInfo[countsFInfo$ID == rownames(res1DFSign)[2],16]#OVER
gene_entrez3<-countsFInfo[countsFInfo$ID == rownames(res1DFSign)[3],16]#UNDER
gene_entrez4<-countsFInfo[countsFInfo$ID == rownames(res1DFSign)[4],16]#UNDER
gene_entrez5<-countsFInfo[countsFInfo$ID == rownames(res1DFSign)[5],16]#OVER
gene_entrez_total_OVER<-as.character(c(gene_entrez1,gene_entrez2,gene_entrez5))
gene_entrez_total_OVER
## [1] "3673" "7052" "440"
gene_entrez_total_UNDER<-as.character(c(gene_entrez3,gene_entrez4))
gene_entrez_total_UNDER
## [1] "1029" "4893"
# Define the universe as all the BioMart-obtained ENTREZ GENE IDs corresponding to our non-duplicated mRNA genes
universeids <- as.character(countsFInfo[,16])
length(universeids)
## [1] 181
#Before running the hypergeometric test with the hyperGTest function, we need to define the parameters
#for the test (gene lists, ontology, test direction) as well as the annotation database to be used.
#The ontology to be tested can be any of the three GO domains: biological process (“BP”), cellular component (“CC”) or molecular function (“MF”).
#We will test for over-represented biological processes in our list of differentially expressed genes.
# define the p-value cut off for the hypergeometric test
hgCutoff <- 0.05
#Conducting test for overexpressed genes
params_over <- new("GOHyperGParams",annotation="org.Hs.eg",geneIds=gene_entrez_total_OVER ,universeGeneIds=universeids,ontology="BP",pvalueCutoff=hgCutoff,testDirection="over")
#Run the test
hg_over <- hyperGTest(params_over)
# Check results
hg_over
## Gene to GO BP test for over-representation
## 425 GO BP ids tested (89 have p < 0.05)
## Selected gene set size: 3
## Gene universe size: 181
## Annotation package: org.Hs.eg
#Get the output table from the test for significant GO terms only by adjusting the pvalues with the p.adjust function.
#Get the p-values of the test
hg.pv_over <- pvalues(hg_over)
## Adjust p-values for multiple test (FDR)
hg.pv.fdr_over <- p.adjust(hg.pv_over,'fdr')
## select the GO terms with adjusted p-value less than the cut off
#sigGO.ID <- names(hg.pv.fdr[hg.pv.fdr < hgCutoff])
#select the GO terms with NON-adjusted p-value less than the cut off
sigGO.ID_over <- names(hg.pv_over[pvalues(hg_over) < hgCutoff])
length(sigGO.ID_over)
## [1] 89
#get table from HyperG test result
df_over <- summary(hg_over)
# keep only significant GO terms in the table
GOannot.table_over <- df_over[df_over[,1] %in% sigGO.ID_over,]
head(GOannot.table_over)
## GOBPID Pvalue OddsRatio ExpCount Count Size
## 1 GO:0050764 0.0005504285 354.00000 0.04972376 2 3
## 2 GO:0030100 0.0064569894 48.85714 0.14917127 2 9
## 3 GO:0006909 0.0137761454 30.36364 0.21546961 2 13
## 4 GO:0006528 0.0165745856 Inf 0.01657459 1 1
## 5 GO:0006529 0.0165745856 Inf 0.01657459 1 1
## 6 GO:0006541 0.0165745856 Inf 0.01657459 1 1
## Term
## 1 regulation of phagocytosis
## 2 regulation of endocytosis
## 3 phagocytosis
## 4 asparagine metabolic process
## 5 asparagine biosynthetic process
## 6 glutamine metabolic process
#Evidently, our statistically differentially overexpressed protein-coding genes are
#associated with phago-and endo-cytosis and asparagine-glutamine metabolic processes
#VISUALIZATION OF mRNA-Seq Gene Expression
#SUBSET LIST OF ANNOTATED mRNA GENES THAT ARE SIGNIFICANTLY DGE BETWEEN OLD AND YOUNG PATIENTS WITH CORRESPONDING GENE POSITION COORDINATES AND CHROMOSOMES:
countsFInfo_sig<-countsFInfo[countsFInfo$ID %in% rownames(res1DFSign),]
countsFInfo_sig<-countsFInfo_sig[,c("ID", "chromosome_name", "start_position", "end_position")]
countsFInfo_sig
## ID chromosome_name start_position end_position
## 8 AKT1S1 19 49869033 49878459
## 17 ASNS 7 97851677 97872542
## 26 NRAS 1 114704469 114716771
## 51 MAPK9 5 180233143 180292099
## 55 ITGA2 5 52989340 53094779
## 59 ADAR 1 154582057 154628013
## 61 EGFR 7 55019017 55211628
## 63 FASN 17 82078338 82098294
## 66 SERPINE1 7 101127104 101139247
## 77 TSC2 16 2047967 2089491
## 80 YBX1 1 42682418 42703805
## 89 SHC1 1 154962298 154974395
## 106 TGM2 20 38127385 38166578
## 122 RAD50 5 132556019 132646349
## 128 PIK3R1 5 68215756 68301821
## 131 XBP1 22 28794555 28800597
## 182 SYK 9 90801787 90898549
## 192 CDKN2A 9 21967752 21995301
#ID chromosome_name start_position end_position
#8 AKT1S1 19 49869033 49878459
#17 ASNS 7 97851677 97872542
#26 NRAS 1 114704469 114716771
#51 MAPK9 5 180233143 180292099
#55 ITGA2 5 52989340 53094779
#59 ADAR 1 154582057 154628013
#61 EGFR 7 55019017 55211628
#63 FASN 17 82078338 82098294
#66 SERPINE1 7 101127104 101139247
#77 TSC2 16 2047967 2089491
#80 YBX1 1 42682418 42703805
#89 SHC1 1 154962298 154974395(154974376?)
#106 TGM2 20 38127385 38166578
#122 RAD50 5 132556019 132646349
#128 PIK3R1 5 68215756 68301821
#131 XBP1 22 28794555 28800597
#182 SYK 9 90801787 90898549
#192 CDKN2A 9 21967752 21995301
#Further Confirmed via NCBI Website, the combined significant DGE genes have the following chromosomal genomic positions"
#Chromosomes 1,5,7 have multiple DGE genes
#GVIZ VISUALIZATION OF mRNA-Seq Gene Expression for CDKN2A gene on chromosome 9:
#Gviz displays information of a genomic region in a specific chromosome. It works with tracks, that need to be defined.
#The virtual parent class for all track items in the package is the GdObject class. This class definition contains all the common
#entities that are needed for a track to be plotted.
#There are constructor functions for each track as well as a broad range of methods to interact with and to plot them.
#Once the tracks defined, we can use function plotTracks() to plot them. We will introduce the basic tracks.
mRNA_expr<-miniACC.assays.comp.age.cnvcalls.ranges[[2]]
rowRanges(mRNA_expr)
## GRanges object with 195 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## DIRAS3 1 68511645-68516481 - | 9077
## MAPK14 6 35995454-36079013 + | 1432
## YAP1 11 101981192-102104154 + | 10413
## CDKN1B 12 12870302-12875305 + | 1027
## ERBB2 17 37844393-37884915 + | 2064
## ... ... ... ... . ...
## MACC1 7 20174279-20257013 - | 346389
## CHGA 14 93389445-93401638 + | 1113
## IDH3A 15 78441719-78462884 + | 3419
## SQSTM1 5 179233388-179265077 + | 8878
## KCNJ13 2 233630512-233641275 - | 3769
## -------
## seqinfo: 25 sequences (1 circular) from 2 genomes (GRCh37.p13, hg19)
#Already a GRanges Object
#mRNA_expr.gr<-unlist(rowRanges(mRNA_expr))#from a GRangesList to a GRanges object?
mRNA_expr.gr<-rowRanges(mRNA_expr)
table(seqnames(mRNA_expr.gr))
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 22 11 13 7 9 5 9 8 9 8 10 11 2 2 7 7
## 17 18 19 20 21 22 X Y chrM
## 13 4 16 8 2 6 6 0 0
#Despite lack of Y chromosomal genes, the gender of patients confirmed as follows:
colData(miniACC.assays.comp.age)$gender
## [1] "female" "female" "female" "male" "female" "female" "female" "female"
## [9] "male" "female"
mRNA_expr.9<-mRNA_expr.gr[seqnames(mRNA_expr.gr)=='9',]
mRNA_expr.9<-keepSeqlevels(mRNA_expr.9,"9") #to remove undesired levels
exprs.9<-assays(mRNA_expr)$exprs[names(mRNA_expr.9),]
head(exprs.9)
## TCGA-OR-A5J9-01A-11R-A29S-07 TCGA-OR-A5JE-01A-11R-A29S-07
## NOTCH1 558.5875 188.3484
## TSC1 948.5448 913.4615
## LCN2 0.0000 1.6968
## RPS6 6238.2972 31138.0090
## TTF1 316.1816 243.2127
## PTCH1 371.2584 169.6833
## TCGA-OR-A5JF-01A-11R-A29S-07 TCGA-OR-A5JI-01A-11R-A29S-07
## NOTCH1 381.5040 566.1425
## TSC1 1076.1657 833.4875
## LCN2 6.1782 6.4755
## RPS6 22022.5891 27255.3191
## TTF1 291.1478 214.6161
## PTCH1 311.2270 128.5846
## TCGA-OR-A5K0-01A-11R-A29S-07 TCGA-OR-A5KV-01A-11R-A29S-07
## NOTCH1 239.9577 180.1003
## TSC1 700.8457 1015.4261
## LCN2 33.2981 0.0000
## RPS6 12192.3890 37627.8442
## TTF1 320.2960 251.0605
## PTCH1 162.7907 52.4489
## TCGA-OR-A5L5-01A-11R-A29S-07 TCGA-OR-A5LC-01A-11R-A29S-07
## NOTCH1 507.9192 343.9620
## TSC1 807.7553 601.7639
## LCN2 1868.9241 378.5617
## RPS6 29292.7362 21307.3270
## TTF1 197.1600 221.1669
## PTCH1 209.7215 334.4640
## TCGA-OR-A5LE-01A-11R-A29S-07 TCGA-OR-A5LL-01A-11R-A29S-07
## NOTCH1 93.4855 136.9822
## TSC1 868.4314 2008.6735
## LCN2 2.4601 82.5482
## RPS6 15172.9482 14326.3048
## TTF1 296.4475 281.1425
## PTCH1 101.4810 145.9548
chr2 <- "chr9"
geno2 <- "hg19"
atrack2 <- AnnotationTrack(mRNA_expr.9, name = "mRNA-Seq for Gene CDKN2A")
gtrack2 <- GenomeAxisTrack()
itrack2 <- IdeogramTrack(gen = geno2, chromosome = chr2)
#We choose to set a from and a to in the plotTracks to delimitate the region
dtrack2 <- DataTrack(data = t(exprs.9), start=start(mRNA_expr.9), end=end(mRNA_expr.9),chromosome = chr2, genome = geno2,name = "mRNA-Seq for Gene CDKN2A")
plotTracks(list(gtrack2, atrack2, itrack2,dtrack2),from=20000000,to=25000000,type="heatmap", col="blue") #dot plot
#data(geneModels) #data.frame containing 97 genes at chromosome 7
#head(geneModels)
#str(geneModels)
#grtrack <- GeneRegionTrack(geneModels, genome = genome(mRNA_expr),chromosome = as.character(unique(seqnames(mRNA_expr))),name = "Gene Model", transcriptAnnotation = "symbol", background.title = "brown")
#head(displayPars(grtrack))
#itrack <- IdeogramTrack(genome = "hg19", chromosome = "chr7")
#We choose to set a from and a to in the plotTracks to delimitate the region
#dtrack <- DataTrack(data = t(exprs.7), start=start(mRNA_expr.7), end=end(mRNA_expr.7),chromosome = as.character(unique(seqnames(mRNA_expr))), genome = genome(mRNA_expr),name = "mRNA-Seq for Chromosome 7")
#The sequence track adds the genomic sequences of nucleotides, when needed.
#strack <- SequenceTrack(Hsapiens, chromosome = as.character(unique(seqnames(mRNA_expr))))
#delimit the region
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack,strack), from = 26591822, to = 26591852, cex = 0.8)
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack), from = 26591822, to = 26591852)
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack), from = 26591822, to = 26591852,type = "histogram")
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack), from = 26591822, to = 26591852,type = "l")
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack), from = 26591822, to = 26591852,type = "heatmap", legend=T)
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack), from = 26591822, to = 26591852,type = "boxplot")
#CIRCOS VISUALIZATION:
options(stringsAsFactors = FALSE) #important argument to keep control of factors, otherwise colors are lost in OmicCircos
seqinfo(mRNA_expr)
## Seqinfo object with 25 sequences (1 circular) from 2 genomes (GRCh37.p13, hg19):
## seqnames seqlengths isCircular genome
## 1 249250621 <NA> GRCh37.p13
## 2 243199373 <NA> GRCh37.p13
## 3 198022430 <NA> GRCh37.p13
## 4 191154276 <NA> GRCh37.p13
## 5 180915260 <NA> GRCh37.p13
## ... ... ... ...
## 21 48129895 <NA> GRCh37.p13
## 22 51304566 <NA> GRCh37.p13
## X 155270560 <NA> GRCh37.p13
## Y 59373566 <NA> GRCh37.p13
## chrM 16571 TRUE hg19
range(assays(mRNA_expr)$"exprs")
## [1] 0.0 206162.3
rr.df<-as.data.frame(rowRanges(mRNA_expr))
rna<-assays(mRNA_expr)$"exprs"
#filtering
SD <-apply(rna,1,sd)
cbind(quantiles <-quantile(SD, probs = seq(0, 1, 0.01)))
## [,1]
## 0% 3.342844e-01
## 1% 4.607587e-01
## 2% 6.208587e+00
## 3% 1.064034e+01
## 4% 1.513817e+01
## 5% 2.092353e+01
## 6% 2.366263e+01
## 7% 2.719731e+01
## 8% 2.995949e+01
## 9% 3.743964e+01
## 10% 4.224227e+01
## 11% 4.661433e+01
## 12% 7.394512e+01
## 13% 7.905309e+01
## 14% 9.587964e+01
## 15% 1.005815e+02
## 16% 1.170762e+02
## 17% 1.321343e+02
## 18% 1.326502e+02
## 19% 1.365523e+02
## 20% 1.730714e+02
## 21% 1.772702e+02
## 22% 1.886710e+02
## 23% 2.053126e+02
## 24% 2.211072e+02
## 25% 2.349678e+02
## 26% 2.488232e+02
## 27% 2.524245e+02
## 28% 2.606581e+02
## 29% 2.770464e+02
## 30% 3.122693e+02
## 31% 3.533687e+02
## 32% 3.639844e+02
## 33% 3.678300e+02
## 34% 3.780775e+02
## 35% 3.791259e+02
## 36% 3.820264e+02
## 37% 3.865365e+02
## 38% 3.882507e+02
## 39% 3.905082e+02
## 40% 3.941259e+02
## 41% 3.994727e+02
## 42% 4.136960e+02
## 43% 4.276093e+02
## 44% 4.440190e+02
## 45% 4.644401e+02
## 46% 4.853544e+02
## 47% 4.902076e+02
## 48% 5.038344e+02
## 49% 5.069006e+02
## 50% 5.094696e+02
## 51% 5.399463e+02
## 52% 5.487840e+02
## 53% 5.562187e+02
## 54% 5.835128e+02
## 55% 6.074455e+02
## 56% 6.229177e+02
## 57% 6.449953e+02
## 58% 6.613223e+02
## 59% 6.983339e+02
## 60% 7.195033e+02
## 61% 7.358404e+02
## 62% 7.620481e+02
## 63% 7.805220e+02
## 64% 8.047014e+02
## 65% 8.131532e+02
## 66% 8.355629e+02
## 67% 8.781700e+02
## 68% 8.843951e+02
## 69% 9.000901e+02
## 70% 9.621060e+02
## 71% 1.106660e+03
## 72% 1.169666e+03
## 73% 1.255653e+03
## 74% 1.296733e+03
## 75% 1.329923e+03
## 76% 1.407456e+03
## 77% 1.496389e+03
## 78% 1.577808e+03
## 79% 1.602427e+03
## 80% 1.655300e+03
## 81% 1.748514e+03
## 82% 1.872036e+03
## 83% 1.965576e+03
## 84% 2.149515e+03
## 85% 2.306744e+03
## 86% 2.513429e+03
## 87% 2.607063e+03
## 88% 2.911394e+03
## 89% 3.109756e+03
## 90% 3.722222e+03
## 91% 4.363940e+03
## 92% 5.108326e+03
## 93% 5.382823e+03
## 94% 5.692236e+03
## 95% 5.832842e+03
## 96% 6.880619e+03
## 97% 7.146997e+03
## 98% 9.846067e+03
## 99% 1.611046e+04
## 100% 5.167708e+04
rna.f<-rna[SD>quantiles["98%"],]
rr.df.f<-rr.df[rownames(rna.f),]
T.rr<-data.frame("chr"=rr.df.f$seqnames,"Start"=as.integer(rr.df.f$start),"End"=as.integer(rr.df.f$end),rna.f,row.names=NULL)
par(mar=c(2, 2, 2, 2));
plot(c(1,800), c(1,800), type="n", axes=F, xlab="", ylab="", main="");
circos(R=380, cir="hg19", W=4, type="chr", print.chr.lab=T, scale=T);
circos(R=320, cir="hg19", W=50, mapping=T.rr, col.v=4, type="heatmap2",B=FALSE, cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue");
#checkout scale, consider transforming it
range(rna.f)
## [1] 2476.411 206162.330
#Perform log transformation with an offset (as log(0)->-Inf))
T.rr<-data.frame("chr"=rr.df.f$seqnames,"Start"=as.integer(rr.df.f$start),"End"=as.integer(rr.df.f$end),log2(rna.f+1),row.names=NULL)
par(mar=c(2, 2, 2, 2));
plot(c(1,800), c(1,800), type="n", axes=F, xlab="", ylab="", main="");
circos(R=400, cir="hg19", W=4, type="chr", print.chr.lab=T, scale=T);
circos(R=340, cir="hg19", W=50, mapping=T.rr, col.v=4, type="heatmap2",B=FALSE, cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue");
#GGBIO VISUALIZATION OF CHROMOSOME#1 GENES NRAS, ADAR,SHC1, and YBX1 mRNA-Seq GENE EXPRESSION:
#Ideogram
p.ideo <- Ideogram(genome = "hg19")
## use chr1 automatically
p.ideo
data(genesymbol, package = "biovizBase")
genesymbol #GRanges object
## GRanges object with 29177 ranges and 2 metadata columns:
## seqnames ranges strand | symbol
## <Rle> <IRanges> <Rle> | <character>
## A1BG chr19 58858174-58864865 - | A1BG
## A2M chr12 9220304-9268558 - | A2M
## NAT1 chr8 18027971-18081197 + | NAT1
## NAT1 chr8 18067618-18081197 + | NAT1
## NAT1 chr8 18079177-18081197 + | NAT1
## ... ... ... ... . ...
## LOC100499405 chr12 9392599-9395645 + | LOC100499405
## LOC100499467 chr17 70399463-70588943 - | LOC100499467
## C9orf174 chr9 100069910-100139575 + | C9orf174
## LOC100499484 chr9 100000708-100059594 + | LOC100499484
## LOC100499489 chr10 22724354-22726858 - | LOC100499489
## ensembl_id
## <character>
## A1BG ENSG00000121410
## A2M ENSG00000175899
## NAT1 ENSG00000171428
## NAT1 ENSG00000171428
## NAT1 ENSG00000171428
## ... ...
## LOC100499405 <NA>
## LOC100499467 <NA>
## C9orf174 ENSG00000197816
## LOC100499484 <NA>
## LOC100499489 <NA>
## -------
## seqinfo: 45 sequences from an unspecified genome; no seqlengths
# select just some symbols
wh <- genesymbol[c("NRAS","ADAR","SHC1")]
# define the range
wh <- range(wh, ignore.strand = TRUE)
# gene model track from OrganismDb object, could also be created from
# TxDb object GRangesList object or EnsDb object
p.genes <- autoplot(Homo.sapiens, which = wh)
## Parsing transcripts...
## Parsing exons...
## Parsing cds...
## Parsing utrs...
## ------exons...
## ------cdss...
## ------introns...
## ------utr...
## aggregating...
## Done
## 'select()' returned 1:1 mapping between keys and columns
## Constructing graphics...
p.genes
## Warning: Removed 234 rows containing missing values or values outside the scale range
## (`geom_text()`).
#plot bam files, containing alignments, extracted from the biovizBase package
#bamfile <- system.file("extdata", "SRX21981997subADAR.bam", package="biovizBase")
#wh <- keepSeqlevels(wh, "chr1")
#bg <- BSgenome.Hsapiens.UCSC.hg19
#p.mis <- autoplot(bamfile, bsgenome = bg, which = wh, stat = "mismatch") #mismatches in the alignments, by nucleotide
#p.mis
#tracks() to bind previously generated plots
#gr1 <- GRanges("chr1", IRanges(114704469, 154974376))
#tks <- tracks(p.ideo, gene = p.genes, mismatch = p.mis, heights = c(2, 10,3)) + xlim(gr1)
#tks
#Another theme to plot
#tks + theme_tracks_sunset()
miRNA-Seq DATA BLOCK ANALYSIS
#Preliminary analysis of individual extracted miRNA-seq Summarized Experiment:
#microRNAs (miRNAs) are short (20-24 nt) non-coding RNAs that are involved in post-transcriptional regulation of gene expression
#in multicellular organisms by affecting both the stability and translation of mRNAs. miRNAs are transcribed by RNA polymerase II
#as part of capped and polyadenylated primary transcripts (pri-miRNAs) that can be either protein-coding or non-coding.
#The primary transcript is cleaved by the Drosha ribonuclease III enzyme to produce an approximately 70-nt stem-loop precursor miRNA (pre-miRNA),
#which is further cleaved by the cytoplasmic Dicer ribonuclease to generate the mature miRNA and antisense miRNA star (miRNA*) products.
#The mature miRNA is incorporated into a RNA-induced silencing complex (RISC), which recognizes target mRNAs through imperfect base pairing
#with the miRNA and most commonly results in translational inhibition or destabilization of the target mRNA.
#The RefSeq represents the predicted microRNA stem-loop.
#Creating a phenotype dataframe for mRNA expression:
phenoN_micro <- data.frame(sample=colnames(mACC.mir.c3),patientID=colData(miniACC.assays.comp.age)$patientID, age.status=colData(miniACC.assays.comp.age)$years_to_birth)
rownames(phenoN_micro)<-phenoN_micro$sample
countsM_micro <- as.matrix(assays(mACC.mir3)$exprs)
#The GENE IDs appear to be HGNC: Official Symbol. MIRLET7A1 provided by HGNC
#Official Full Name microRNA let-7a-1 provided by HGNC, for example.
#This would suggest that over 50% of genes are under microRNA regulation.
#https://www.ncbi.nlm.nih.gov/gene/406881
#https://www.ensembl.org/biomart/martview/bcd31ecb53c27f25ed8176ab4dfef813
sum(is.na(countsM_micro))
## [1] 0
#As part of the exploration, we plot data
boxplot(countsM_micro) #They didn't apply log2 on the TMM for transformation
#Fifth and Last sample appears to have outliers
boxplot(log2(countsM_micro+2))
#Check Library size
lSize_micro <- colSums(countsM_micro)
lSize_micro #all sample sums > 1M (not = 1M as expected for TMM normalization) and non-homogeneous
## TCGA-OR-A5J9-01A-11R-A29W-13 TCGA-OR-A5JE-01A-11R-A29W-13
## 4541066 5125120
## TCGA-OR-A5JF-01A-11R-A29W-13 TCGA-OR-A5JI-01A-11R-A29W-13
## 5098006 6600740
## TCGA-OR-A5K0-01A-11R-A29W-13 TCGA-OR-A5KV-01A-11R-A29W-13
## 6624927 2408786
## TCGA-OR-A5L5-01A-11R-A29W-13 TCGA-OR-A5LC-01A-11R-A29W-13
## 3018597 4371030
## TCGA-OR-A5LE-01A-11R-A29W-13 TCGA-OR-A5LL-01A-11R-A29W-13
## 9484599 6751885
#We study total of reads per sample (library size).
sampleT_micro <- apply(countsM_micro, 2, sum)/10^6
range(sampleT_micro)
## [1] 2.408786 9.484599
sampleTDF_micro <- data.frame(sample=names(sampleT_micro), total=sampleT_micro)
p <- ggplot(aes(x=sample, y=sampleT_micro, fill=sampleT_micro), data=sampleTDF_micro) + geom_bar(stat="identity")
p + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("")
#Evidently, sample 6 and 7 have relatively fewer reads and sample 8 has most reads
#Our "old" group and "young" group have 5 samples each (10 patients, 10 samples total)
keep_micro <- rowSums(countsM_micro > 10) >= 5 # at least 5 samples have 10 reads per gene
countsF_micro <- countsM_micro[keep_micro,]
#ensembl_reg_102<-useEnsembl(biomart = 'regulation', dataset = 'hsapiens_gene_ensembl',version = 102)
#getAnnotation(mart,featureType = c("TSS", "miRNA", "Exon", "5utr", "3utr", "ExonPlusUtr", "transcript"))
searchFilters(mart = ensembl102, pattern = "miRBase")
## name
## 32 with_mirbase
## 33 with_mirbase_trans_name
## 91 mirbase_accession
## 92 mirbase_id
## 93 mirbase_trans_name
## description
## 32 With miRBase ID(s)
## 33 With miRBase transcript name ID(s)
## 91 miRBase accession(s) [e.g. MI0000060]
## 92 miRBase ID(s) [e.g. hsa-let-7a-1]
## 93 miRBase transcript name ID(s) [e.g. hsa-mir-1253.1-201]
gensInfo_micro = getBM(c("mirbase_id","ensembl_gene_id","chromosome_name", "start_position","end_position", "entrezgene_id","hgnc_symbol","description"), filters=c("mirbase_id", "with_mirbase"), values=list(rownames(countsF_micro), TRUE), mart=ensembl102)
gensInfo_micro$length <- gensInfo_micro$end_position - gensInfo_micro$start_position
range(gensInfo_micro$length)
## [1] 51 148
#Confirms the small number of nucleotides in miRNA
dim(gensInfo_micro) #notice different length of genes, there are some repetitions and some missing values
## [1] 302 9
table(duplicated(gensInfo_micro$mirbase_id))
##
## FALSE TRUE
## 291 11
gensInfo_micro[duplicated(gensInfo_micro$mirbase_id),]
## mirbase_id ensembl_gene_id chromosome_name start_position end_position
## 25 hsa-mir-1229 ENSG00000221394 5 179798278 179798346
## 79 hsa-mir-181c ENSG00000207613 19 13874699 13874808
## 81 hsa-mir-181d ENSG00000207585 19 13874875 13875011
## 100 hsa-mir-1976 ENSG00000238705 1 26554542 26554593
## 129 hsa-mir-23a ENSG00000207980 19 13836587 13836659
## 133 hsa-mir-24-2 ENSG00000284387 19 13836287 13836359
## 138 hsa-mir-27a ENSG00000207808 19 13836440 13836517
## 206 hsa-mir-423 ENSG00000283935 17 30117079 30117172
## 242 hsa-mir-509-2 ENSG00000208000 X 147260532 147260625
## 262 hsa-mir-598 ENSG00000207600 8 11035206 11035302
## 277 hsa-mir-675 ENSG00000284010 11 1996759 1996831
## entrezgene_id hgnc_symbol
## 25 100302156 MIR1229
## 79 406957 MIR181C
## 81 574457 MIR181D
## 100 100302190 MIR1976
## 129 407010 MIR23A
## 133 407013 MIR24-2
## 138 407018 MIR27A
## 206 494335 MIR423
## 242 574514 MIR509-1
## 262 693183 MIR598
## 277 100033819 MIR675
## description length
## 25 microRNA 1229 [Source:HGNC Symbol;Acc:HGNC:33924] 68
## 79 microRNA 181c [Source:HGNC Symbol;Acc:HGNC:31552] 109
## 81 microRNA 181d [Source:HGNC Symbol;Acc:HGNC:32089] 136
## 100 microRNA 1976 [Source:HGNC Symbol;Acc:HGNC:37064] 51
## 129 microRNA 23a [Source:HGNC Symbol;Acc:HGNC:31605] 72
## 133 microRNA 24-2 [Source:HGNC Symbol;Acc:HGNC:31608] 72
## 138 microRNA 27a [Source:HGNC Symbol;Acc:HGNC:31613] 77
## 206 microRNA 423 [Source:HGNC Symbol;Acc:HGNC:31880] 93
## 242 microRNA 509-1 [Source:HGNC Symbol;Acc:HGNC:32146] 93
## 262 microRNA 598 [Source:HGNC Symbol;Acc:HGNC:32854] 96
## 277 microRNA 675 [Source:HGNC Symbol;Acc:HGNC:33351] 72
#12 duplicates need to be removed
length(setdiff(rownames(countsF_micro), gensInfo_micro$mirbase_id))
## [1] 24
countsFDF_micro <- data.frame(ID=rownames(countsF_micro),countsF_micro)
countsFInfo_micro <- right_join(countsFDF_micro, gensInfo_micro, by=c("ID"="mirbase_id"))
countsFInfo_micro <- countsFInfo_micro[!duplicated(countsFInfo_micro$ID),] #After having checked duplications, just keep first result
#To perform FPKM (for paired-end reads) or RPKM (for single-end reads), we first divide by the library size and then by gene length.
#Notice that the sum of each sample after FPKM normalization is different.Assuming that for short miRNA reads, only single-end sequencing performed
#step 1: normalize for read depth and multiply by million
readD_micro <- apply(countsFInfo_micro[,2:11], 2, function(x) x / sum(x) * 10^6)
#step 2. scale by gene length and multiply by thousand
countsRPKM_micro <- readD_micro / countsFInfo_micro$length * 10^3
colSums(countsRPKM_micro)
## TCGA.OR.A5J9.01A.11R.A29W.13 TCGA.OR.A5JE.01A.11R.A29W.13
## 12246757 11596788
## TCGA.OR.A5JF.01A.11R.A29W.13 TCGA.OR.A5JI.01A.11R.A29W.13
## 12009815 12506499
## TCGA.OR.A5K0.01A.11R.A29W.13 TCGA.OR.A5KV.01A.11R.A29W.13
## 11784215 11737401
## TCGA.OR.A5L5.01A.11R.A29W.13 TCGA.OR.A5LC.01A.11R.A29W.13
## 12113583 11190660
## TCGA.OR.A5LE.01A.11R.A29W.13 TCGA.OR.A5LL.01A.11R.A29W.13
## 11303870 11656854
#To perform TPM, we first divide by the gene length and then we divide by the transformed sequencing depth.
#Check that the sum of each column after TPM normalization equals to 10^6.
sampleTF_micro <- colSums(countsFInfo_micro[,2:11])
#step 1: divide by gene length and multiply by thousand to obtain the reads per kilobase (RPK)
rpk_micro <- countsFInfo_micro[,2:11] / countsFInfo_micro$length * 10^3
#step 2: divide by sequencing depth and multiply by million
countsTPM_micro <- apply(rpk_micro, 2, function(x) x / sum(x) * 10^6)
#check totals (All equal to 1 million)
colSums(countsTPM_micro)
## TCGA.OR.A5J9.01A.11R.A29W.13 TCGA.OR.A5JE.01A.11R.A29W.13
## 1e+06 1e+06
## TCGA.OR.A5JF.01A.11R.A29W.13 TCGA.OR.A5JI.01A.11R.A29W.13
## 1e+06 1e+06
## TCGA.OR.A5K0.01A.11R.A29W.13 TCGA.OR.A5KV.01A.11R.A29W.13
## 1e+06 1e+06
## TCGA.OR.A5L5.01A.11R.A29W.13 TCGA.OR.A5LC.01A.11R.A29W.13
## 1e+06 1e+06
## TCGA.OR.A5LE.01A.11R.A29W.13 TCGA.OR.A5LL.01A.11R.A29W.13
## 1e+06 1e+06
#PREPARING DATAFRAME FOR FUTURE CNV VS. miRNA-Seq VS. mRNA-Seq CORRELATION ANALYSIS AND MFA
countsF_TPM_LOG_micro<-log2(countsTPM_micro[,1:10]+2)
countsF_TPM_LOG_DF_micro<-as.data.frame(countsF_TPM_LOG_micro)
countsF_TPM_LOG_DF_micro$ID<-countsFInfo_micro$ID
countsF_TPM_LOG_DF_micro$chr<-countsFInfo_micro$chromosome_name
countsF_TPM_LOG_DF_micro$start<-countsFInfo_micro$start_position
countsF_TPM_LOG_DF_micro$end<-countsFInfo_micro$end_position
#PCA for miRNA-Seq
countsF_TPM_LOG_DF_micro_PCAMFA<-countsF_TPM_LOG_DF_micro[,1:10]
#Transpose
countsF_TPM_LOG_DF_micro_PCAMFA.t<-t(countsF_TPM_LOG_DF_micro_PCAMFA)
# assign names, we include a micexp suffix to differentiate genes from cnv or exp
colnames(countsF_TPM_LOG_DF_micro_PCAMFA.t)<-paste(countsF_TPM_LOG_DF_micro$ID,"micexp",sep=".")
#Construct data.frame to perform PCA
miexpr4pca<-data.frame(cond2,countsF_TPM_LOG_DF_micro_PCAMFA.t)
res.pca.miexpr<-PCA(miexpr4pca,quali.sup=1)
res.pca.miexpr
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 10 individuals, described by 292 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$quali.sup" "results for the supplementary categorical variables"
## 12 "$quali.sup$coord" "coord. for the supplementary categories"
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"
## 14 "$call" "summary statistics"
## 15 "$call$centre" "mean of the variables"
## 16 "$call$ecart.type" "standard error of the variables"
## 17 "$call$row.w" "weights for the individuals"
## 18 "$call$col.w" "weights for the variables"
plot(res.pca.miexpr,habillage=1)
#With the exception of young patients A5J9 and A5JI and old patient A5LC, we observe differences between the young and old patient samples (in dim 1 and dim2)
#and 28.27+18.86%=47.13%total variance is captured by the first 2 dimensions, respectively.
#Normalization using TMM (edgeR package)
d_micro <- DGEList(counts = countsF_micro)
Norm.Factor_micro <- calcNormFactors(d_micro, method = "TMM")
countsTMM_micro <- cpm(Norm.Factor_micro, log = T)
countsTMMnoLog_micro <- cpm(Norm.Factor_micro, log = F)
#See how distribution of the three normalizations (in log2) change (for the first sample).
hist(log2(countsRPKM_micro[,1]+2), xlab="log2-ratio", main="RPKM_micro")
#Appears to be a normal distribution of log2-ratios
hist(log2(countsTPM_micro[,1]+2), xlab="log2-ratio", main="TPM_micro")
#Appears to be a normal distribution of log2-ratios
hist(countsTMM_micro[,1], xlab="log2-ratio", main="TMM_micro")
#Appears to be a normal distribution of log2-ratios
#Sample aggregation
#To see how samples aggregate, we will perform hierarchical clustering as well as PCA.
#The purpose is to see whether samples aggregate by condition or there are some outliers, that might have a biological or technical causes.
#Hierarchical clustering
x_micro<-countsTMM_micro
#Euclidean distance
clust.cor.ward_micro <- hclust(dist(t(x_micro)),method="ward.D2")
plot(clust.cor.ward_micro, main="hierarchical clustering", hang=-1,cex=0.8)
#WITH EXCEPTION OF PATIENT TCGA-OR-A5LC, The ward.D2 hierarchal clustering appears to partially reflect the segregation of 5 old and 5 young patients
clust.cor.average_micro <- hclust(dist(t(x_micro)),method="average")
plot(clust.cor.average_micro, main="hierarchical clustering", hang=-1,cex=0.8)
#The average hierarchal clustering appears to partially reflect the segregation of 5 old and 5 young patients
clust.cor.average_micro <- hclust(dist(t(x_micro)),method="complete")
plot(clust.cor.average_micro, main="hierarchical clustering", hang=-1,cex=0.8)
#The complete hierarchal clustering appears to partially reflect the segregation of 5 old and 5 young patients
#Correlation-based distance
clust.cor.ward_micro <- hclust(as.dist(1-cor(x_micro)),method="ward.D2")
plot(clust.cor.ward_micro, main="hierarchical clustering", hang=-1,cex=0.8)
#The ward.D2 hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients
clust.cor.average_micro<- hclust(as.dist(1-cor(x_micro)),method="average")
plot(clust.cor.average_micro, main="hierarchical clustering", hang=-1,cex=0.8)
#The average hierarchal clustering does not appear to reflect the segregation of 5 old and 5 young patients
cond2<-phenoN_micro$age.status
countsF_micro_backup<-as.matrix(countsF_micro)
sum1<-sum(is.na(countsF_micro_backup))
sum1
## [1] 0
#[1] 0
#Density plot of raw read counts (log10)
countsSF_micro_backup_log <- log(countsF_micro_backup,10)
d <- density(countsSF_micro_backup_log)
plot(d,xlim=c(1,8),main="",ylim=c(0,.45),xlab="Raw filtered read counts per gene (log10 transformation)", ylab="Density")
for (s in 1:length(colnames(countsSF_micro_backup_log))){
countsSF_micro_backup_log <- log(countsF_micro_backup[,s],10)
d <- density(countsSF_micro_backup_log)
lines(d)
}
#Box plots of raw filtered read counts after log10 transformation
countsSF_micro_backup_log <- log(countsF_micro_backup,10)
boxplot(countsSF_micro_backup_log , main="", xlab="", ylab="Raw read counts per gene (log10)",axes=FALSE)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 4 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 6 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 7 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 8 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 9 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 10 is not drawn
axis(2)
axis(1,at=c(1:length(colnames(countsSF_micro_backup_log))),labels=colnames(countsSF_micro_backup_log),las=2,cex.axis=0.8)
#Heatmap with condition age.status as labels
colnames(countsF_micro_backup)<-phenoN_micro$age.status
#Plot heatmap
heatmap(countsF_micro_backup, col = topo.colors(50), margin=c(10,6))
#Evidently one young patient is overexpressing many miRNA genes
#PCA
#Transpose the data to have variables (genes) as columns
data_for_PCA2 <- t(countsF_micro_backup)
#The cmdscale function will calculate a matrix of dissimilarities from the transposed data
#and will also provide information about the proportion of explained variance by calculating Eigen values.
## calculate MDS (matrix of dissimilarities)
mds2 <- cmdscale(dist(data_for_PCA2), k=3, eig=TRUE)
mds2$eig
## [1] 5.986174e+12 3.577930e+12 1.628514e+12 1.230555e+12 4.871337e+11
## [6] 4.195289e+11 1.086244e+11 3.114651e+10 9.893546e+09 -5.227671e-04
#Plotting this variable as a percentage will help determine how many components can explain the variability
#in your dataset and thus how many dimensions you should be looking at.
#Transform the Eigen values into percentage
eig_pc2 <- mds2$eig * 100 / sum(mds2$eig)
#Plot the PCA
barplot(eig_pc2,las=1,xlab="Dimensions", ylab="Proportion of explained variance (%)", y.axis=NULL,col="darkgrey")
#In most cases, the first 2 components explain more than half the variability in the dataset and can be used for plotting.
#The cmdscale function run with default parameters will perform a principal components analysis on the given data matrix and
#the plot function will provide scatter plots for individuals representation.
#Calculate MDS
mds2 <- cmdscale(dist(data_for_PCA2)) # Performs MDS analysis
#Samples representation
plot(mds2[,1], -mds2[,2], type="n", xlab="Dimension 1", ylab="Dimension 2", main="")
text(mds2[,1], -mds2[,2], rownames(mds2), cex=0.8)
#library ggfortify needed for the autoplot to understand and plot PCA results
summary(pca.filt_micro <- prcomp(t(x_micro), scale=T ))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 9.1799 8.4363 6.4439 5.49967 5.13749 4.4086 4.28110
## Proportion of Variance 0.2675 0.2259 0.1318 0.09602 0.08379 0.0617 0.05818
## Cumulative Proportion 0.2675 0.4935 0.6253 0.72131 0.80510 0.8668 0.92498
## PC8 PC9 PC10
## Standard deviation 3.71113 3.1398 7.651e-15
## Proportion of Variance 0.04372 0.0313 0.000e+00
## Cumulative Proportion 0.96870 1.0000 1.000e+00
autoplot(pca.filt_micro, data=phenoN_micro, colour="patientID", shape="age.status")
#There does not appear to be segregation by age status
#Note that a total of 26.75%+ 22.59%=49.24% variance is accounted for by the
#first 2 principal components PC1 and PC2 and corresponding eigenvector values
#LIMMA-BASED Differentially Expressed miRNA genes analysis
cond2<-phenoN_micro$age.status
phenoN_micro[colnames(countsF_micro),]$age.status== cond2
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#Create design matrix for limma
design2 <- model.matrix(~0+cond2)
# substitute "cond2" from the design column names
colnames(design2)<- gsub("cond2","",colnames(design2))
# check design matrix
design2
## old young
## 1 0 1
## 2 0 1
## 3 1 0
## 4 0 1
## 5 1 0
## 6 0 1
## 7 1 0
## 8 1 0
## 9 0 1
## 10 1 0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$cond2
## [1] "contr.treatment"
#calculate normalization factors between libraries
nf2 <- calcNormFactors(countsF_micro)
# normalize the read counts with 'voom' function
y2 <- voom(countsF_micro,design2,lib.size=colSums(countsF_micro)*nf2)
#Extract the Normalized read counts
counts.voom2 <- y2$E
#Fit linear model for each gene given a series of libraries
fit2 <- lmFit(y2,design2)
# construct the contrast matrix corresponding to specified contrasts of a set of parameters
cont.matrix2 <- makeContrasts(old-young,levels=design2)
cont.matrix2
## Contrasts
## Levels old - young
## old 1
## young -1
# compute estimated coefficients and standard errors for a given set of contrasts
fit2 <- contrasts.fit(fit2, cont.matrix2)
# compute moderated t-statistics of differential expression by empirical Bayes moderation of the standard errors
fit2 <- eBayes(fit2)
options(digits=3)
# check the output fit
dim(fit2)
## [1] 315 1
#Set adjusted pvalue threshold and log fold change threshold
mypval=0.01
myfc=3
#Get the coefficient name for the comparison of interest
colnames(fit2$coefficients)
## [1] "old - young"
mycoef="old - young"
# Get the output table for the 10 most significant DE genes for this comparison
topTable(fit2,coef=mycoef)
## logFC AveExpr t P.Value adj.P.Val B
## hsa-mir-542 -1.264 10.63 -2.97 0.0134 0.566 -4.53
## hsa-let-7e -1.067 12.29 -2.54 0.0286 0.566 -4.53
## hsa-mir-10a 1.167 14.49 2.18 0.0533 0.566 -4.54
## hsa-mir-28 0.696 10.95 2.25 0.0469 0.566 -4.55
## hsa-let-7f-2 -0.714 14.31 -1.95 0.0788 0.566 -4.55
## hsa-mir-483 -4.348 8.07 -2.64 0.0237 0.566 -4.55
## hsa-mir-29a 0.930 11.94 2.01 0.0707 0.566 -4.55
## hsa-mir-508 3.563 11.27 2.10 0.0606 0.566 -4.56
## hsa-mir-181a-1 -1.057 11.64 -1.99 0.0731 0.566 -4.56
## hsa-mir-98 -1.352 6.43 -2.76 0.0194 0.566 -4.56
#Get the full table ("n = number of genes in the fit")
limma.res <- topTable(fit2,coef=mycoef,n=dim(fit2)[1])
#Get significant DE genes only (adjusted p-value < mypval).
#The adjusted p-value was increased to obtain a list of genes
limma.res.pval <- topTable(fit2,coef=mycoef,n=dim(fit2)[1],p.val=0.57)
dim(limma.res.pval)
## [1] 69 6
#Get significant DE genes with low adjusted p-value high fold change
limma.res.pval.FC <- limma.res.pval[which(abs(limma.res.pval$logFC)>myfc),]
dim(limma.res.pval.FC)
## [1] 19 6
limma.res.pval.FC
## logFC AveExpr t P.Value adj.P.Val B
## hsa-mir-483 -4.35 8.067 -2.64 0.0237 0.566 -4.55
## hsa-mir-508 3.56 11.273 2.10 0.0606 0.566 -4.56
## hsa-mir-509-2 3.64 7.671 2.25 0.0471 0.566 -4.56
## hsa-mir-509-1 3.61 7.685 2.24 0.0476 0.566 -4.57
## hsa-mir-509-3 3.46 8.024 2.14 0.0570 0.566 -4.57
## hsa-mir-153-2 -4.57 4.647 -2.64 0.0240 0.566 -4.57
## hsa-mir-514-3 3.61 8.133 1.96 0.0766 0.566 -4.57
## hsa-mir-514-1 3.57 8.127 1.95 0.0789 0.566 -4.57
## hsa-mir-514-2 3.58 8.101 1.92 0.0832 0.566 -4.57
## hsa-mir-511-2 3.05 0.532 2.90 0.0151 0.566 -4.57
## hsa-mir-514b 3.61 2.630 2.23 0.0489 0.566 -4.58
## hsa-mir-513c 3.54 3.795 2.02 0.0693 0.566 -4.58
## hsa-mir-506 3.29 5.270 1.83 0.0960 0.566 -4.58
## hsa-mir-513a-1 4.19 1.740 2.20 0.0515 0.566 -4.58
## hsa-mir-412 -3.76 5.697 -1.71 0.1170 0.566 -4.58
## hsa-mir-153-1 -3.89 0.441 -2.16 0.0550 0.566 -4.58
## hsa-mir-507 3.01 2.743 1.83 0.0954 0.566 -4.58
## hsa-mir-329-2 -3.04 1.730 -1.90 0.0848 0.566 -4.58
## hsa-mir-513a-2 3.44 1.763 1.69 0.1211 0.566 -4.59
#Standard edgeR differential expression analysis
design <- model.matrix(~ cond2)
# Using trended dispersions
dge <- DGEList(counts = countsF_micro)
dge <- calcNormFactors(dge)
dge$samples$age.status <- cond2
dge <- estimateGLMCommonDisp(dge, design)
dge <- estimateGLMTrendedDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)
# Fit GLM model for strain effect
fit <- glmFit(dge, design)
lrt <- glmLRT(fit)
#Table of unadjusted p-values (PValue) and FDR values
p_val_DE_edgeR <- topTags(lrt, adjust.method = 'BH', n = Inf)
# Getting top differentially expressed miRNA's
top_miRNAs <- rownames(p_val_DE_edgeR$table)[1:10]
top_miRNAs
## [1] "hsa-mir-153-2" "hsa-mir-153-1" "hsa-mir-541" "hsa-mir-412"
## [5] "hsa-mir-3200" "hsa-mir-675" "hsa-mir-1248" "hsa-mir-9-2"
## [9] "hsa-mir-9-1" "hsa-mir-1229"
#DESeq2 DIFFERENTAILLY EXPRESSED GENE ANALYSIS
sum_na<-sum(is.na(countsF_micro))
#DESeq2 on COUNT MATRIX:
#Filtering is also advised by DESeq2, so we will create the DESeqDataSet from the filtered counts matrix.
countsF_int_micro<-countsF_micro
object.size(countsF_int_micro)
## 49296 bytes
mode(countsF_int_micro) <- "integer"
object.size(countsF_int_micro)
## 36696 bytes
dds_micro <- DESeqDataSetFromMatrix(countData = countsF_int_micro,colData = phenoN_micro,design = ~ age.status)
#To benefit from the default settings of the package, you should put the variable of interest at
#the end of the formula and make sure the control level is the first level. This is not necessary if contrast option is used as here
dds_micro <- DESeq(dds_micro)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## -- note: fitType='parametric', but the dispersion trend was not well captured by the
## function: y = a/x + b, and a local regression fit was automatically substituted.
## specify fitType='local' or 'mean' to avoid this message next time.
## final dispersion estimates
## fitting model and testing
# Global model
resG_micro <- results(dds_micro, alpha=0.05) #lfcThreshold is by default 0
summary(resG_micro)
##
## out of 315 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 12, 3.8%
## LFC < 0 (down) : 2, 0.63%
## outliers [1] : 11, 3.5%
## low counts [2] : 0, 0%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
#Contrasts, we just check two of them
res1_micro <- results(dds_micro, contrast=c("age.status","old","young"))
summary(res1_micro)
##
## out of 315 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 6, 1.9%
## LFC < 0 (down) : 18, 5.7%
## outliers [1] : 11, 3.5%
## low counts [2] : 0, 0%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
res1DF_micro <- as.data.frame(res1_micro)
res1DFS_micro <- res1DF_micro[order(res1DF_micro$pvalue),]
res1DFSign_micro <- res1DFS_micro[!is.na(res1DFS_micro$pvalue) & res1DFS_micro$pvalue<0.05, ]
res1DFSign_micro
## baseMean log2FoldChange lfcSE stat pvalue padj
## hsa-mir-153-2 789.7 -4.706 1.151 -4.09 4.32e-05 0.00827
## hsa-mir-3200 90.3 -3.007 0.745 -4.04 5.44e-05 0.00827
## hsa-mir-675 275.3 3.015 0.856 3.52 4.30e-04 0.02790
## hsa-mir-153-1 39.6 -4.825 1.394 -3.46 5.38e-04 0.02790
## hsa-mir-148b 854.5 -0.972 0.286 -3.40 6.79e-04 0.02790
## hsa-mir-9-2 43653.9 -3.558 1.067 -3.33 8.53e-04 0.02790
## hsa-mir-542 8920.8 -1.268 0.382 -3.32 9.14e-04 0.02790
## hsa-mir-541 327.6 -4.658 1.406 -3.31 9.26e-04 0.02790
## hsa-mir-9-1 43735.3 -3.541 1.070 -3.31 9.35e-04 0.02790
## hsa-mir-412 2806.0 -4.575 1.389 -3.29 9.93e-04 0.02790
## hsa-mir-1229 50.8 -3.170 0.964 -3.29 1.01e-03 0.02790
## hsa-mir-511-1 18.4 2.375 0.754 3.15 1.63e-03 0.03872
## hsa-mir-98 510.6 -1.435 0.456 -3.15 1.66e-03 0.03872
## hsa-mir-887 496.3 -2.218 0.714 -3.11 1.90e-03 0.04122
## hsa-mir-9-3 85.3 -3.593 1.190 -3.02 2.53e-03 0.05121
## hsa-mir-380 184.0 -3.739 1.270 -2.94 3.23e-03 0.06141
## hsa-mir-421 27.0 -1.422 0.489 -2.91 3.63e-03 0.06493
## hsa-mir-222 67.1 1.915 0.683 2.81 5.02e-03 0.08410
## hsa-mir-221 179.4 1.801 0.645 2.79 5.26e-03 0.08410
## hsa-mir-28 10061.0 0.723 0.264 2.74 6.10e-03 0.08902
## hsa-mir-103-2 87.5 -1.238 0.453 -2.74 6.24e-03 0.08902
## hsa-mir-598 549.9 -2.123 0.779 -2.72 6.44e-03 0.08902
## hsa-mir-1287 47.6 2.079 0.774 2.68 7.28e-03 0.09616
## hsa-mir-432 3174.8 -3.707 1.393 -2.66 7.79e-03 0.09873
## hsa-mir-324 633.5 -1.850 0.702 -2.63 8.42e-03 0.10239
## hsa-mir-200c 102.6 2.479 0.953 2.60 9.32e-03 0.10897
## hsa-let-7e 28565.1 -1.161 0.449 -2.58 9.75e-03 0.10973
## hsa-mir-329-2 58.7 -3.059 1.198 -2.55 1.07e-02 0.11602
## hsa-mir-339 177.1 0.693 0.274 2.52 1.16e-02 0.12180
## hsa-mir-135a-1 59.5 2.954 1.181 2.50 1.24e-02 0.12564
## hsa-mir-103-1 112148.6 -1.223 0.499 -2.45 1.44e-02 0.12881
## hsa-mir-16-1 1276.7 1.146 0.469 2.45 1.45e-02 0.12881
## hsa-mir-410 2400.5 -3.340 1.371 -2.44 1.48e-02 0.12881
## hsa-mir-3648 38.5 -2.465 1.015 -2.43 1.52e-02 0.12881
## hsa-mir-431 4569.2 -3.348 1.381 -2.42 1.53e-02 0.12881
## hsa-mir-141 24.7 2.439 1.010 2.42 1.57e-02 0.12881
## hsa-mir-889 4421.5 -3.116 1.296 -2.40 1.62e-02 0.12881
## hsa-mir-668 40.3 -3.202 1.333 -2.40 1.63e-02 0.12881
## hsa-mir-769 122.6 -1.054 0.440 -2.40 1.65e-02 0.12881
## hsa-mir-424 3299.2 1.242 0.528 2.35 1.87e-02 0.14194
## hsa-mir-10a 138399.9 1.213 0.526 2.31 2.11e-02 0.15675
## hsa-mir-301a 115.2 -1.684 0.752 -2.24 2.51e-02 0.18156
## hsa-mir-217 140.6 2.284 1.032 2.21 2.69e-02 0.18986
## hsa-mir-128-1 464.0 -0.717 0.327 -2.19 2.84e-02 0.19031
## hsa-mir-511-2 15.9 2.290 1.048 2.19 2.89e-02 0.19031
## hsa-mir-214 30.9 1.658 0.760 2.18 2.91e-02 0.19031
## hsa-mir-139 19962.0 -2.161 0.992 -2.18 2.94e-02 0.19031
## hsa-mir-758 1362.4 -2.636 1.216 -2.17 3.03e-02 0.19166
## hsa-mir-375 348.0 2.093 0.975 2.15 3.18e-02 0.19712
## hsa-mir-370 1355.1 -2.782 1.305 -2.13 3.30e-02 0.20073
## hsa-mir-497 75.9 1.586 0.750 2.11 3.46e-02 0.20292
## hsa-mir-33b 16.7 1.280 0.607 2.11 3.51e-02 0.20292
## hsa-mir-496 360.8 -2.537 1.209 -2.10 3.59e-02 0.20292
## hsa-mir-223 320.8 1.059 0.507 2.09 3.69e-02 0.20292
## hsa-mir-361 3451.4 1.127 0.540 2.09 3.70e-02 0.20292
## hsa-mir-425 1296.0 -1.273 0.612 -2.08 3.75e-02 0.20292
## hsa-mir-362 149.1 1.462 0.705 2.07 3.80e-02 0.20292
## hsa-mir-329-1 59.2 -2.819 1.373 -2.05 4.01e-02 0.21019
## hsa-mir-503 3750.5 -1.338 0.661 -2.03 4.28e-02 0.22043
## hsa-mir-433 353.8 -2.892 1.440 -2.01 4.47e-02 0.22635
## hsa-mir-382 2017.9 -2.516 1.271 -1.98 4.78e-02 0.23809
#Volcano plot
colorS <- c("blue", "grey", "red")
#CHECK p or p.adj
#specific parameters
showGenes <- 20 #genes to be displayed with names
dataV <- topTable(fit2, n = Inf, coef = mycoef, adjust = "fdr")
dataV <- dataV %>% mutate(gene = rownames(dataV), logp = -(log10(P.Value)), logadjp = -(log10(adj.P.Val)),
FC = ifelse(logFC>0, 2^logFC, -(2^abs(logFC)))) %>%
mutate(sig = ifelse(P.Value<0.01 & logFC > 1, "UP", ifelse(P.Value<0.01 & logFC < (-1), "DN","n.s"))) #ideally we should have an adj.P.Val < 0.05
p <- ggplot(data=dataV, aes(x=logFC, y=logp )) +
geom_point(alpha = 1, size= 1, aes(col = sig)) +
scale_color_manual(values = colorS) +
xlab(expression("log"[2]*"FC")) + ylab(expression("-log"[10]*"(p.val)")) + labs(col=" ") +
geom_vline(xintercept = 1, linetype= "dotted") + geom_vline(xintercept = -1, linetype= "dotted") +
geom_hline(yintercept = -log10(0.1), linetype= "dotted") + theme_bw()
p <- p + geom_text_repel(data = head(dataV[dataV$sig != "n.s",],showGenes), aes(label = gene))
print(p)
#Evidently, based on first limma-based DEG model, expression of gene hsa-mir-511-1 and hsa-mir-675 are significantly upregulated
#as a function of age status factor (levels young/old)
#Heatmap
#Plotting heatmap results for the limma model (without adjusting for variable patientID).
t1 <- topTable(fit2, n = Inf, coef = mycoef, adjust = "fdr")
res1 <- t1[t1$P.Value<0.01 & abs(t1$logFC) > 1,]
data.clus <- countsTMM_micro[rownames(res1),]
cond2.df <- as.data.frame(cond2)
rownames(cond2.df) <- colnames(data.clus)
pheatmap(data.clus, scale = "row", show_rownames = TRUE, annotation_col = cond2.df)
#Evidently, miRNA genes hsa-mir-511-1 is overepxressed in old patient A5LL, A5L5 and underexpressed in young patients
#A5LE and A5J9 and A5KV.
#On the other hand, miRNA gene hsa-mir-675 is underexpressed in young patients A5J9, A5JI, A5K0, A5JE,
#A5KV and overexpressed in A5LL, A5JF, and slightly in A5LC, A5L5.
#GENE ANNOTATION AND GENE ONTOLOGY FOR DIFFERENTIALLY OVEREXPRESSED miRNA GENES
#Load the library
#The central ID for org.Hs.eg.db, a genome-wide annotation for humans based on Entrez Gene, is the NCBI Gene ID.
#org.Hs.egACCNUM is an R object that contains mappings between Entrez Gene identifiers and
#GenBank accession numbers.
# Define list of genes of interest (DE genes - EntrezGene IDs)
mirbase_ids <- as.character(rownames(limma.res.pval.FC))
length(mirbase_ids)
## [1] 19
#We explore gene ontology for 2 select, significantly diiferentially expressed or high logfold changed miRNA genes
#and convert and obtain ENTREZ gene IDs for GoSTATS
genes_mirbase <- c(mirbase_ids[1], rownames(dataV)[11])
genes_ensembl1<-countsFInfo_micro[countsFInfo_micro$ID == genes_mirbase[1],12]
genes_ensembl2<-countsFInfo_micro[countsFInfo_micro$ID == genes_mirbase[2],12]
#genes_ensembl3<-countsFInfo_micro[countsFInfo_micro$ID == "hsa-mir-511-1",12]
genes_ensembl<-c(genes_ensembl1,genes_ensembl2)
genes_ensembl
## [1] "ENSG00000207805" "ENSG00000288367"
mapIds(org.Hs.eg.db,keys = genes_ensembl,column = 'ENTREZID',keytype = 'ENSEMBL')
## 'select()' returned 1:1 mapping between keys and columns
## ENSG00000207805 ENSG00000288367
## "619552" "100033819"
select(org.Hs.eg.db,keys = genes_ensembl,column = c('SYMBOL', 'ENTREZID', 'ENSEMBL'),keytype = 'ENSEMBL')
## 'select()' returned 1:1 mapping between keys and columns
## ENSEMBL SYMBOL ENTREZID
## 1 ENSG00000207805 MIR483 619552
## 2 ENSG00000288367 MIR675 100033819
genes_entrez<-c("619552","100033819")
#Define the universe as all the BioMart-obtained ENTREZ GENE IDs corresponding to our non-duplicated miRNA genes
universeids <- as.character(countsFInfo_micro[,16])
length(universeids)
## [1] 291
#Before running the hypergeometric test with the hyperGTest function, we need to define the parameters
#for the test (gene lists, ontology, test direction) as well as the annotation database to be used.
#The ontology to be tested can be any of the three GO domains: biological process (“BP”), cellular component (“CC”) or molecular function (“MF”).
#We will test for over-represented biological processes in our list of differentially expressed genes.
# define the p-value cut off for the hypergeometric test
hgCutoff <- 0.05
params <- new("GOHyperGParams",annotation="org.Hs.eg",geneIds=genes_entrez,universeGeneIds=universeids,ontology="BP",pvalueCutoff=hgCutoff,testDirection="over")
## Warning in makeValidParams(.Object): removing duplicate IDs in universeGeneIds
#Run the test
hg <- hyperGTest(params)
#Check results
hg
## Gene to GO BP test for over-representation
## 326 GO BP ids tested (68 have p < 0.05)
## Selected gene set size: 2
## Gene universe size: 257
## Annotation package: org.Hs.eg
#We can get the output table from the test for significant GO terms only by adjusting the pvalues with the p.adjust function.
#Get the p-values of the test
hg.pv <- pvalues(hg)
#Adjust p-values for multiple test (FDR)
hg.pv.fdr <- p.adjust(hg.pv,'fdr')
#select the GO terms with adjusted p-value less than the cut off
#sigGO.ID <- names(hg.pv.fdr[hg.pv.fdr < hgCutoff])
#select the GO terms with NON-adjusted p-value less than the cut off
sigGO.ID <- names(hg.pv[pvalues(hg) < hgCutoff])
length(sigGO.ID)
## [1] 68
#Get table from HyperG test result
df <- summary(hg)
#Keep only significant GO terms in the table
GOannot.table <- df[df[,1] %in% sigGO.ID,]
head(GOannot.table)
## GOBPID Pvalue OddsRatio ExpCount Count Size
## 1 GO:0010563 0.00201 Inf 0.0934 2 12
## 2 GO:0045936 0.00201 Inf 0.0934 2 12
## 3 GO:0006793 0.00638 Inf 0.1634 2 21
## 4 GO:0006796 0.00638 Inf 0.1634 2 21
## 5 GO:0019220 0.00638 Inf 0.1634 2 21
## 6 GO:0051174 0.00638 Inf 0.1634 2 21
## Term
## 1 negative regulation of phosphorus metabolic process
## 2 negative regulation of phosphate metabolic process
## 3 phosphorus metabolic process
## 4 phosphate-containing compound metabolic process
## 5 regulation of phosphate metabolic process
## 6 regulation of phosphorus metabolic process
#Evidently, our statistically differentially expressed miRNA genes are associated with regualtion of phosphorous metabolism
#The R package multiMiR, with web server at http://multimir.org, is a comprehensive collection of predicted and validated miRNA-target
#interactions and their associations with diseases and drugs.
#To retrieve validated miRNA -target gene interaction yielded ~11 000 target genes suggesting that over 50% of human genes are under microRNA regulation.
vers_table <- multimir_dbInfoVersions()
vers_table
## VERSION UPDATED RDA DBNAME
## 1 2.3.0 2020-04-15 multimir_cutoffs_2.3.rda multimir2_3
## 2 2.2.0 2017-08-08 multimir_cutoffs_2.2.rda multimir2_2
## 3 2.1.0 2016-12-22 multimir_cutoffs_2.1.rda multimir2_1
## 4 2.0.0 2015-05-01 multimir_cutoffs.rda multimir
## SCHEMA PUBLIC TABLES
## 1 multiMiR_DB_schema.sql 1 multiMiR_dbTables.txt
## 2 multiMiR_DB_schema.sql 1 multiMiR_dbTables.txt
## 3 multiMiR_DB_schema.sql 1 multiMiR_dbTables.txt
## 4 multiMiR_DB_schema.sql 1 multiMiR_dbTables.txt
curr_vers <- vers_table[1, "VERSION"] # current version
multimir_switchDBVersion(db_version = curr_vers)
## Now using database version: 2.3.0
#Now using database version: 2.3.0
#The function multimir_dbInfo() will display information about the external miRNA and miRNA-target databases in multiMiR,
#including version, release date, link to download the data, and the corresponding table in multiMiR.
db.info = multimir_dbInfo()
db.info
## map_name source_name source_version source_date
## 1 diana_microt DIANA-microT 5 Sept, 2013
## 2 elmmo EIMMo 5 Jan, 2011
## 3 microcosm MicroCosm 5 Sept, 2009
## 4 mir2disease miR2Disease Mar 14, 2011
## 5 miranda miRanda Aug, 2010
## 6 mirdb miRDB 6 June, 2019
## 7 mirecords miRecords 4 Apr 27, 2013
## 8 mirtarbase miRTarBase 7.0 Sept, 2017
## 9 pharmaco_mir Pharmaco-miR (Verified Sets)
## 10 phenomir PhenomiR 2 Feb 15, 2011
## 11 pictar PicTar 2 Dec 21, 2012
## 12 pita PITA 6 Aug 31, 2008
## 13 tarbase TarBase 8 2018
## 14 targetscan TargetScan 7.2 March, 2018
## source_url
## 1 http://diana.imis.athena-innovation.gr/DianaTools/index.php?r=microT_CDS/index
## 2 http://www.mirz.unibas.ch/miRNAtargetPredictionBulk.php
## 3 http://www.ebi.ac.uk/enright-srv/microcosm/cgi-bin/targets/v5/download.pl
## 4 http://www.mir2disease.org
## 5 http://www.microrna.org/microrna/getDownloads.do
## 6 http://mirdb.org
## 7 http://mirecords.biolead.org/download.php
## 8 http://mirtarbase.mbc.nctu.edu.tw/php/download.php
## 9 http://www.pharmaco-mir.org/home/download_VERSE_db
## 10 http://mips.helmholtz-muenchen.de/phenomir/
## 11 http://dorina.mdc-berlin.de
## 12 http://genie.weizmann.ac.il/pubs/mir07/mir07_data.html
## 13 http://carolina.imis.athena-innovation.gr/diana_tools/web/index.php?r=tarbasev8%2Findex
## 14 http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_61
#Among the 14 external databases, eight contain predicted miRNA-target interactions (DIANA-microT-CDS, ElMMo, MicroCosm, miRanda, miRDB, PicTar, PITA, and TargetScan),
#three have experimentally validated miRNA-target interactions (miRecords, miRTarBase, and TarBase) and the remaining three contain miRNA-drug/disease associations
#(miR2Disease, Pharmaco-miR, and PhenomiR). To check these categories and databases from within R, we have a set of four helper functions:
predicted_tables()
## [1] "diana_microt" "elmmo" "microcosm" "miranda" "mirdb"
## [6] "pictar" "pita" "targetscan"
validated_tables()
## [1] "mirecords" "mirtarbase" "tarbase"
#get_multimir() is the main function in the package to retrieve predicted and validated miRNA-target
#interactions and their disease and drug associations from the multiMiR database.
#Plug miRNA's into multiMiR and getting validated targets
#multimir_target_results <- get_multimir(org = 'mmu', mirna = "hsa-mir-382", table = 'predicted', summary = TRUE)
#Retrieving all gene targets of miRNA gene hsa-miR-107 and miRNA genes previously determined to be
#statistically significantly differentially expressed by age.status in our dataframe and list from combining limma+DESeq2+EDGER approaches:
#"hsa-mir-153-2" "hsa-mir-153-1" "hsa-mir-541" "hsa-mir-412" "hsa-mir-3200"
#"hsa-mir-675" "hsa-mir-1248" "hsa-mir-9-2" "hsa-mir-9-1" "hsa-mir-1229" , "hsa-mir-511-1","hsa-mir-507","hsa-mir-107"
#hsa-mir-148b hsa-mir-542 hsa-mir-98 hsa-mir-887 hsa-mir-9-3
example1 <- get_multimir(mirna = countsFInfo_micro[18,1] , summary = TRUE)
## Searching mirecords ...
## Searching mirtarbase ...
## Searching tarbase ...
head(example1@data)
## database mature_mirna_acc mature_mirna_id target_symbol target_entrez
## 1 mirecords MIMAT0000104 hsa-miR-107 BACE1 23621
## 2 mirecords MIMAT0000104 hsa-miR-107 SERBP1 26135
## 3 mirecords MIMAT0000104 hsa-miR-107 AGO1 26523
## 4 mirecords MIMAT0000104 hsa-miR-107 AGO2 27161
## 5 mirecords MIMAT0000104 hsa-miR-107 AGO3 192669
## 6 mirecords MIMAT0000104 hsa-miR-107 CCNE1 898
## target_ensembl experiment support_type pubmed_id type
## 1 ENSG00000186318 Luciferase activity assay 18234899 validated
## 2 ENSG00000142864 17637574 validated
## 3 ENSG00000092847 Western blot 20042474 validated
## 4 ENSG00000123908 Western blot 20042474 validated
## 5 ENSG00000126070 Western blot 20042474 validated
## 6 ENSG00000105173 19688090 validated
#rownames(limma.res.pval.FC)="hsa-mir-507"
example2 <- get_multimir(mirna = "hsa-mir-507" , summary = TRUE)
## Searching mirecords ...
## Searching mirtarbase ...
## Searching tarbase ...
head(example2@data)
## database mature_mirna_acc mature_mirna_id target_symbol target_entrez
## 1 mirtarbase MIMAT0002879 hsa-miR-507 CLOCK 9575
## 2 mirtarbase MIMAT0002879 hsa-miR-507 MYO10 4651
## 3 mirtarbase MIMAT0002879 hsa-miR-507 MYO10 4651
## 4 mirtarbase MIMAT0002879 hsa-miR-507 RBM47 54502
## 5 mirtarbase MIMAT0002879 hsa-miR-507 CAND1 55832
## 6 mirtarbase MIMAT0002879 hsa-miR-507 POGK 57645
## target_ensembl experiment support_type pubmed_id type
## 1 ENSG00000134852 HITS-CLIP Functional MTI (Weak) 23824327 validated
## 2 ENSG00000145555 PAR-CLIP Functional MTI (Weak) 22012620 validated
## 3 ENSG00000145555 PAR-CLIP Functional MTI (Weak) 21572407 validated
## 4 ENSG00000163694 HITS-CLIP Functional MTI (Weak) 23824327 validated
## 5 ENSG00000111530 PAR-CLIP Functional MTI (Weak) 24398324 validated
## 6 ENSG00000143157 PAR-CLIP Functional MTI (Weak) 20371350 validated
example3 <- get_multimir(mirna = "hsa-mir-1248", summary = TRUE)
## Searching mirecords ...
## Searching mirtarbase ...
## Searching tarbase ...
head(example3@data)
## database mature_mirna_acc mature_mirna_id target_symbol target_entrez
## 1 mirtarbase MIMAT0005900 hsa-miR-1248 LMNB1 4001
## 2 mirtarbase MIMAT0005900 hsa-miR-1248 CDKN1A 1026
## 3 mirtarbase MIMAT0005900 hsa-miR-1248 PRRG4 79056
## 4 mirtarbase MIMAT0005900 hsa-miR-1248 SP1 6667
## 5 mirtarbase MIMAT0005900 hsa-miR-1248 MYC 4609
## 6 mirtarbase MIMAT0005900 hsa-miR-1248 HMGB1 3146
## target_ensembl experiment support_type pubmed_id type
## 1 ENSG00000113368 HITS-CLIP Functional MTI (Weak) 23313552 validated
## 2 ENSG00000124762 PAR-CLIP Functional MTI (Weak) 21572407 validated
## 3 ENSG00000135378 HITS-CLIP Functional MTI (Weak) 23824327 validated
## 4 ENSG00000185591 HITS-CLIP Functional MTI (Weak) 23824327 validated
## 5 ENSG00000136997 TRAP Functional MTI (Weak) 24510096 validated
## 6 ENSG00000189403 HITS-CLIP Functional MTI (Weak) 23824327 validated
#Of all in the DGE miRNA gene list, only 3 were successfully queried with get_multimir to identify their mRNA targets
#Of all identified targets of these 3, only CDKN1A target of hsa-miR-1248 and SERBP1 target of hsa-miR-107 appear distantly related (by gene symbol similarity)
#to the RNA-seq DGE genes of CDKN2A and SERPINE1. We will therefore plot these miRNA expression levels
#Using alternative approach, we additionally obtain the targets from `r Biocpkg("RmiR.Hs.miRNA")` using the connection to TargetScan,
#and the function miRNAGenes we will use later on to obtain the target for each differentially miRNA obtained.
#We will obtain the targets from RmiR.Hs.miRNA using the connection to TargetScan in function miRNAGenes.
#In addition, this function will use biomaRt to retrieve the HGNC symbols.
#This is the function we will use later on to obtain the target for each differentially miRNA obtained and for miRNA vs. mRNA correlation analysis.
#miRNA database and biomaRt connections
dbListTables(RmiR.Hs.miRNA_dbconn())
## [1] "miranda" "mirbase" "mirtarget2" "pictar" "tarbase"
## [6] "targetscan"
#An example connecting to tarbase
#dbGetQuery(RmiR.Hs.miRNA_dbconn(),"SELECT * FROM tarbase WHERE mature_miRNA='hsa-miR-21'")
#ensembl=useMart("ensembl",dataset="hsapiens_gene_ensembl")
ensembl3 <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl") #using useEnsembl instead of useMart
miRNAGenes<-function(miRNA){
# OLD VERSIONS: Function to obtain gene targets from all databases given a miRNA
# query.targetscan <- "SELECT * FROM targetscan WHERE mature_miRNA=?"
targetscan <- dbReadTable(RmiR.Hs.miRNA_dbconn(), "targetscan")[,1:2]
class(targetscan)#dataframe
gens<-array(NA)
gens.sel.symbol<- ""
# OLD VERSIONS
# g.targetscan <- dbGetPreparedQuery(RmiR.Hs.miRNA_dbconn(), query.targetscan,bind.data=as.data.frame(miRNA))$gene_id
#Warning message:RSQLite::dbGetPreparedQuery() is deprecated, please switch to DBI::dbGetQuery(params = bind.data).
#g.targetscan <- DBI::dbGetQuery(RmiR.Hs.miRNA_dbconn(), query.targetscan, bind.data=as.data.frame(miRNA))$miRNA
g.targetscan <- targetscan[targetscan$mature_miRNA ==miRNA,"gene_id" ]
if (length(g.targetscan)>0) {
gens.sel.symbol<-getBM(attributes="hgnc_symbol",filters="entrezgene_id",values=g.targetscan,mart=ensembl3)$hgnc_symbol
}
return(gens.sel.symbol)
}
#TESTED THIS FUNCTION ON SEVERAL SETS OF SIGNIFICANT DGE miRNA genes:
#miRNAs_test<-rownames(limma.res.pval.FC)
#miRNAs_test<-rownames(assay(mACC.mir3))
miRNAs_test<-c("hsa-miR-107" )
for (i in miRNAs_test){
miRNA.genes_test<-miRNAGenes(i)
}
miRNA.genes_test
## [1] "ABCF2" "GPC6" "ACTR2" "TSPAN5" "YAF2"
## [6] "CDK6" "CDK8" "SPRY3" "CORO2B" "ARIH2"
## [11] "VAV3" "CARM1" "AGPAT1" "ERLIN1" "EXOC5"
## [16] "ENTREP3" "NUP50" "WASF3" "TSPAN9" "MMP24"
## [21] "FERMT2" "CHD1" "ABHD2" "CHD2" "CLASRP"
## [26] "BAZ2A" "AKAP13" "PDCD10" "PRRT2" "CHRM1"
## [31] "SLC2A13" "SLITRK1" "SLC26A7" "MARCHF3" "ADCYAP1"
## [36] "CLCN5" "ADD2" "ARL8A" "SYT2" "UBR3"
## [41] "DCBLD2" "CSNK1G2" "FAM81A" "SYT6" "RC3H1"
## [46] "CTNND1" "FAM117B" "BTLA" "RNF38" "NEK10"
## [51] "CREBRF" "AMOT" "SLC35G1" "KANK4" "DLG4"
## [56] "RCAN1" "EBF1" "AGO4" "SCAMP5" "EFNB2"
## [61] "CELSR2" "EIF1AX" "EIF4B" "EIF5" "CC2D1B"
## [66] "HACD2" "EN2" "ENSA" "FAM219A" "ZNF449"
## [71] "AK2" "USF3" "ESR1" "ETV6" "LRRC55"
## [76] "RTKN2" "RBM24" "ATXN7L1" "ZNRF2" "FGF7"
## [81] "COBLL1" "RAB11FIP2" "CPEB3" "SLITRK3" "FOXJ3"
## [86] "DKK1" "IGSF9B" "ZHX3" "PEG10" "FSTL4"
## [91] "TNRC6B" "HIC2" "GPATCH8" "DCUN1D4" "GGA3"
## [96] "SEPTIN8" "FLOT2" "FAF2" "SIK2" "PLCB1"
## [101] "PPIP5K2" "ZC3H7B" "MGA" "KLHL18" "SATB2"
## [106] "RPGRIP1L" "WASHC4" "ICE1" "DICER1" "ZFPM2"
## [111] "TARDBP" "SLC35A3" "SUZ12" "SH3BP4" "BCL2L13"
## [116] "CNOT6L" "GABRB1" "SCML4" "GABRG2" "BCLAF3"
## [121] "SUN2" "TMEM184B" "RNF19A" "ADGRA2" "HIGD1A"
## [126] "SPATS2L" "UPF2" "APPL1" "RAI14" "POLDIP2"
## [131] "FBXO10" "AGO1" "LATS2" "AP3M1" "ABL2"
## [136] "FOXP1" "GK" "AFF4" "VPS4A" "DISC1"
## [141] "PCDH17" "TMEM121B" "GLUD1" "GNAI3" "GNS"
## [146] "AQP11" "ANKRD52" "DLL1" "FRYL" "ANK1"
## [151] "ANK3" "GRIA4" "HIPK2" "HAPSTR1" "TFCP2L1"
## [156] "PACSIN1" "BAZ2B" "HCFC1" "HTT" "HLF"
## [161] "HMGA1" "HNRNPA2B1" "APBA1" "AGFG1" "IGSF3"
## [166] "HTR4" "KRTAP11-1" "KY" "ZC3H12B" "SYT10"
## [171] "LANCL3" "IHH" "IRF2BP2" "CCDC178" "KCNC4"
## [176] "MIGA1" "RAB15" "KIF5A" "KIF5C" "KPNA1"
## [181] "KPNA3" "KPNA4" "TNPO1" "CEP85L" "C3P1"
## [186] "CD164L2" "PCARE" "ARHGAP5" "C12orf76" "SNX30"
## [191] "ZBTB34" "LRP1" "LRP2" "ARNT" "MAP4"
## [196] "MBNL1" "MECP2" "MEF2D" "GALNTL6" "PALM2AKAP2"
## [201] "MTF1" "MYBL1" "MYH9" "NEDD9" "NF1"
## [206] "NFIA" "NFIB" "ATP1B2" "NKTR" "NOTCH2"
## [211] "NOVA1" "NPAS2" "NTRK2" "FURIN" "CD207"
## [216] "ST8SIA3" "PHF20" "RASL12" "ZDHHC3" "WNT16"
## [221] "PDE3B" "PDE4D" "UBE2J1" "ANKFY1" "HACD3"
## [226] "SUFU" "CAB39" "CDK12" "SIX4" "GALNT7"
## [231] "CDK14" "PIK3R1" "PI4KB" "PITPNA" "PLAG1"
## [236] "BCL11A" "CHIC1" "LRP1B" "UBL3" "WNT4"
## [241] "CCNJ" "OTUD4" "CNNM2" "SNRK" "FNBP1L"
## [246] "ZCCHC2" "INO80D" "TMEM260" "UBE2R2" "RNF125"
## [251] "USP47" "BSDC1" "ARHGAP17" "LRRC8D" "ARMC1"
## [256] "RFWD3" "PPP2R5C" "ZNF654" "UBE2W" "FBXW7"
## [261] "PPP3R1" "PPP6C" "ETNK1" "CDV3" "KIF21A"
## [266] "PRKAB2" "IPO9" "DCP1A" "PRKCE" "FOXJ2"
## [271] "PAG1" "ASH1L" "MYNN" "PRKG1" "GPCPD1"
## [276] "PRMT8" "KCMF1" "FEM1C" "POGLUT1" "TULP4"
## [281] "GOPC" "PELI2" "ADAMTSL3" "PPP4R3B" "PTH"
## [286] "SRGAP1" "NUFIP2" "SEMA6A" "TWF1" "SIPA1L2"
## [291] "SLAIN2" "ADGRB3" "SPTBN4" "RAP2C" "PURB"
## [296] "NECTIN1" "SINHCAF" "RAN" "PLEKHA1" "TMEM35A"
## [301] "BCL2L2" "RGS4" "TGIF2" "BACH2" "RPS6KA3"
## [306] "CLIP1" "BDNF" "SALL1" "SCN1A" "SCN2A"
## [311] "SDCBP" "ANO3" "DUS1L" "BLMH" "ATL2"
## [316] "TENT4B" "GREM2" "ITSN1" "SH3GL2" "TNS3"
## [321] "ST3GAL2" "GNPNAT1" "SOWAHC" "WNK1" "SLC5A3"
## [326] "ZBTB8A" "SLC8A2" "SLC20A2" "SLN" "ZBTB10"
## [331] "SMARCE1" "SNCG" "SOS1" "SPTBN1" "ST13"
## [336] "VAMP1" "TDG" "TGFBR2" "TGFBR3" "TGM3"
## [341] "THRB" "THY1" "TLE4" "ACTG1" "TPD52"
## [346] "NR2C2" "UMOD" "VCL" "VCP" "NSD2"
## [351] "YWHAH" "ZNF711" "ZKSCAN1" "PCGF2" "TRIM26"
## [356] "CACNA1C" "CACNA2D1" "BTG2" "CRELD1" "ST8SIA4"
## [361] "DCAF10" "ATP13A3" "PLEKHF2" "LIN28A" "GSTCD"
## [366] "LONRF3" "SYNDIG1" "SVEP1" "FBXL18" "NAA15"
## [371] "CCDC6" "KMT2D" "KDM7A" "VOPP1" "NDEL1"
## [376] "CAMK2G" "RAB1B" "NRIP1" "CAPZA2" "AXIN2"
## [381] "EOMES" "FZD4" "RASSF5" "TMEM47" "HSDL1"
## [386] "USP42" "ZNRF3" "EVA1A" "SYDE2" "CHD6"
## [391] "MAF1" "CAMKK1" "PCGF5" "RAB11FIP4" "DYRK2"
## [396] "MAP3K21" "PHYHIPL" "LCOR" "CUL4A" "KRTAP4-4"
## [401] "MFSD14B" "OGT" "PHF5A" "TMEM25" "RSPO3"
## [406] "VCF1" "AJUBA" "CNTNAP1" "STRIP1" "ZC3H12C"
## [411] "SSH2" "CDC14A" "RUNX1T1" "IRS2" "VAMP8"
## [416] "VAMP4" "CDC23" "SNX3" "RNMT" "DCAF5"
## [421] "CDK5R1" "PER3" "DDX18" "HERC2" "BTRC"
## [426] "WNT3A" "NAV1" "NAV2" "CCNE1" "FCHSD1"
## [431] "TMEM250" "SH2D2A" "MTMR4" "SCAF11" "YTHDC1"
## [436] "DCLK1" "DSEL" "DLG5" "CENPBD1P" "ACVR2B"
## [441] "KLF4" "NREP" "COPS2" "UBE4A" "KIF3B"
## [446] "NMT2" "MED26" "PSMF1" "KIF23" "CLOCK"
## [451] "CREB5" "N4BP1" "PPP6R2" "TBKBP1" "SUSD6"
## [456] "KIAA0232" "RIMS3" "MAML1" "GIT2" "JAKMIP2"
## [461] "C2CD5" "TLK1" "ZBTB39" "NUAK1" "G3BP2"
## [466] "MFN2" "JOSD1" "HELZ" "AMMECR1" "CDC27"
## [471] "SLC12A6"
#VISUALIZATION OF miRNA-Seq BLOCK DATA
#SUBSET LIST OF ANNOTATED miRNA GENES THAT ARE SIGNIFICANTLY DGE BETWEEN OLD AND YOUNG PATIENTS WITH CORRESPONDING GENE POSITION COORDINATES AND CHROMOSOMES:
countsFInfo_micro_sig<-countsFInfo_micro[countsFInfo_micro$ID %in% c("hsa-mir-153-2", "hsa-mir-153-1", "hsa-mir-541","hsa-mir-412","hsa-mir-3200", "hsa-mir-675","hsa-mir-1248", "hsa-mir-9-2","hsa-mir-9-1","hsa-mir-1229", "hsa-mir-511-1","hsa-mir-507","hsa-mir-107",
"hsa-mir-148b", "hsa-mir-542", "hsa-mir-98", "hsa-mir-887", "hsa-mir-9-3"),]
countsFInfo_micro_sig<-countsFInfo_micro_sig[,c("ID", "chromosome_name", "start_position", "end_position")]
countsFInfo_micro_sig
## ID chromosome_name start_position end_position
## 18 hsa-mir-107 10 89592747 89592827
## 24 hsa-mir-1229 CHR_HG30_PATCH 179799144 179799212
## 27 hsa-mir-1248 3 186786672 186786777
## 61 hsa-mir-148b 12 54337216 54337314
## 65 hsa-mir-153-1 2 219294111 219294200
## 66 hsa-mir-153-2 7 157574336 157574422
## 156 hsa-mir-3200 22 30731557 30731641
## 203 hsa-mir-412 14 101065447 101065537
## 238 hsa-mir-507 X 147230984 147231077
## 252 hsa-mir-541 14 101064495 101064578
## 253 hsa-mir-542 X 134541341 134541437
## 276 hsa-mir-675 CHR_HG28_PATCH 1998778 1998850
## 288 hsa-mir-887 5 15935182 15935260
## 290 hsa-mir-9-1 1 156420341 156420429
## 291 hsa-mir-9-2 5 88666853 88666939
## 292 hsa-mir-9-3 15 89368017 89368106
## 300 hsa-mir-98 X 53556223 53556341
#Based on NCBI, hsa-mir-1229 and hsa-mir-675 are located on chromosomes 5q35.3 and 11
#Gene hsa-mir-511-1 is situated on chromosome 10 at 17845107..17845193
#Evidently, chromosomes #x and 5 has the most (3) significantly DGE miRNA genes
miRNA_expr<-miniACC.assays.comp.age.cnvcalls.ranges[[4]]
#Already a GRanges Object (No need to unlist)
miRNA_expr.gr<-rowRanges(miRNA_expr)
#GVIZ VISUALIZATION OF mRNA-Seq Gene Expression for hsa-mir-107 gene on chromosome 10:
miRNA_expr.10<-miRNA_expr.gr[seqnames(miRNA_expr.gr)=='10',]
miRNA_expr.10<-keepSeqlevels(miRNA_expr.10,"10") #to remove undesired levels
exprs.10<-assays(miRNA_expr)$exprs[names(miRNA_expr.10),]
head(exprs.10)
## TCGA-OR-A5J9-01A-11R-A29W-13 TCGA-OR-A5JE-01A-11R-A29W-13
## hsa-mir-107 486 238
## hsa-mir-1287 14 4
## hsa-mir-1296 20 17
## hsa-mir-1307 17146 15253
## hsa-mir-146b 164 2008
## hsa-mir-202 16535 9335
## TCGA-OR-A5JF-01A-11R-A29W-13 TCGA-OR-A5JI-01A-11R-A29W-13
## hsa-mir-107 376 241
## hsa-mir-1287 33 37
## hsa-mir-1296 16 8
## hsa-mir-1307 5148 6484
## hsa-mir-146b 497 3543
## hsa-mir-202 14761 2724
## TCGA-OR-A5K0-01A-11R-A29W-13 TCGA-OR-A5KV-01A-11R-A29W-13
## hsa-mir-107 346 88
## hsa-mir-1287 38 12
## hsa-mir-1296 13 5
## hsa-mir-1307 10990 9169
## hsa-mir-146b 706 324
## hsa-mir-202 11136 13359
## TCGA-OR-A5L5-01A-11R-A29W-13 TCGA-OR-A5LC-01A-11R-A29W-13
## hsa-mir-107 77 287
## hsa-mir-1287 189 26
## hsa-mir-1296 7 24
## hsa-mir-1307 3501 17001
## hsa-mir-146b 1254 1124
## hsa-mir-202 2461 9924
## TCGA-OR-A5LE-01A-11R-A29W-13 TCGA-OR-A5LL-01A-11R-A29W-13
## hsa-mir-107 275 270
## hsa-mir-1287 46 48
## hsa-mir-1296 31 3
## hsa-mir-1307 28446 9177
## hsa-mir-146b 602 1367
## hsa-mir-202 7464 21922
chr <- "chr10"
geno <- "hg19"
atrack <- AnnotationTrack(miRNA_expr.10, name = "miRNA-Seq for Gene hsa-mir-107")
gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(gen = geno, chromosome = chr)
#We choose to set a from and a to in the plotTracks to delimitate the region
dtrack <- DataTrack(data = t(exprs.10), start=start(miRNA_expr.10), end=end(miRNA_expr.10),chromosome = chr, genome = geno,name = "miRNA-Seq for Gene hsa-mir-107")
plotTracks(list(gtrack, atrack, itrack,dtrack),from=89590000 ,to=89600000,type="heatmap", col="blue") #dot plot
#CIRCOS VISUALIZATION:
options(stringsAsFactors = FALSE)
rr.df_micro<-as.data.frame(rowRanges(miRNA_expr))
rna_micro<-assays(miRNA_expr)$"exprs"
#Filtering
SD_micro <-apply(rna_micro,1,sd)
cbind(quantiles <-quantile(SD_micro, probs = seq(0, 1, 0.01)))
## [,1]
## 0% 0.00e+00
## 1% 0.00e+00
## 2% 3.16e-01
## 3% 5.08e-01
## 4% 8.47e-01
## 5% 1.25e+00
## 6% 1.39e+00
## 7% 1.51e+00
## 8% 1.76e+00
## 9% 2.02e+00
## 10% 2.32e+00
## 11% 2.60e+00
## 12% 2.83e+00
## 13% 2.97e+00
## 14% 3.16e+00
## 15% 3.48e+00
## 16% 4.03e+00
## 17% 4.38e+00
## 18% 4.70e+00
## 19% 5.20e+00
## 20% 5.68e+00
## 21% 6.11e+00
## 22% 7.16e+00
## 23% 8.04e+00
## 24% 8.88e+00
## 25% 9.89e+00
## 26% 1.05e+01
## 27% 1.12e+01
## 28% 1.20e+01
## 29% 1.31e+01
## 30% 1.45e+01
## 31% 1.53e+01
## 32% 1.77e+01
## 33% 1.87e+01
## 34% 2.26e+01
## 35% 2.43e+01
## 36% 2.79e+01
## 37% 3.14e+01
## 38% 3.71e+01
## 39% 4.18e+01
## 40% 4.45e+01
## 41% 4.75e+01
## 42% 4.92e+01
## 43% 5.12e+01
## 44% 5.55e+01
## 45% 6.57e+01
## 46% 7.40e+01
## 47% 8.24e+01
## 48% 9.19e+01
## 49% 9.88e+01
## 50% 1.08e+02
## 51% 1.20e+02
## 52% 1.30e+02
## 53% 1.40e+02
## 54% 1.46e+02
## 55% 1.61e+02
## 56% 1.72e+02
## 57% 1.92e+02
## 58% 2.13e+02
## 59% 2.26e+02
## 60% 2.38e+02
## 61% 2.52e+02
## 62% 2.69e+02
## 63% 2.94e+02
## 64% 3.22e+02
## 65% 3.59e+02
## 66% 3.93e+02
## 67% 4.00e+02
## 68% 4.58e+02
## 69% 5.48e+02
## 70% 5.79e+02
## 71% 6.22e+02
## 72% 6.77e+02
## 73% 7.40e+02
## 74% 8.16e+02
## 75% 9.34e+02
## 76% 1.00e+03
## 77% 1.04e+03
## 78% 1.23e+03
## 79% 1.47e+03
## 80% 1.75e+03
## 81% 1.91e+03
## 82% 2.21e+03
## 83% 2.36e+03
## 84% 2.87e+03
## 85% 3.42e+03
## 86% 3.86e+03
## 87% 4.96e+03
## 88% 5.57e+03
## 89% 6.00e+03
## 90% 7.06e+03
## 91% 7.98e+03
## 92% 8.93e+03
## 93% 1.20e+04
## 94% 1.80e+04
## 95% 2.32e+04
## 96% 3.10e+04
## 97% 5.54e+04
## 98% 1.07e+05
## 99% 3.09e+05
## 100% 7.59e+05
rna.f_micro<-rna_micro[SD_micro>quantiles["98%"],]
rr.df.f_micro<-rr.df_micro[rownames(rna.f_micro),]
T.rr_micro<-data.frame("chr"=rr.df.f_micro$seqnames,"Start"=as.integer(rr.df.f_micro$start),"End"=as.integer(rr.df.f_micro$end),rna.f_micro,row.names=NULL)
par(mar=c(2, 2, 2, 2));
plot(c(1,800), c(1,800), type="n", axes=F, xlab="", ylab="", main="");
circos(R=380, cir="hg19", W=4, type="chr", print.chr.lab=T, scale=T);
circos(R=320, cir="hg19", W=50, mapping=T.rr_micro, col.v=4, type="heatmap2",B=FALSE, cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue");
#checkout scale, consider transforming it
range(rna.f_micro) #[1] 2476 206162
## [1] 205 2753979
#Perform log transformation with an offset (as log(0)->-Inf))
T.rr_micro<-data.frame("chr"=rr.df.f_micro$seqnames,"Start"=as.integer(rr.df.f_micro$start),"End"=as.integer(rr.df.f_micro$end),log2(rna.f_micro+1),row.names=NULL)
par(mar=c(2, 2, 2, 2));
plot(c(1,800), c(1,800), type="n", axes=F, xlab="", ylab="", main="");
circos(R=400, cir="hg19", W=4, type="chr", print.chr.lab=T, scale=T);
circos(R=340, cir="hg19", W=50, mapping=T.rr_micro, col.v=4, type="heatmap2",B=FALSE, cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue");
GISTIC CNV DATA BLOCK ANALYSIS
#Preliminary analysis of individual extracted CNV Summarized Experiment:
#The following text summary is cited from the following url:
#https://bioconductor.org/packages/devel/bioc/vignettes/CNVRanger/inst/doc/CNVRanger.html
#Title: Summarization and quantitative trait analysis of CNV ranges
#Author: Vinicius Henrique da Silva1 and Ludwig Geistlinger:
#Copy number variation (CNV) is a frequently observed deviation from the diploid state due to duplication or deletion of genomic regions.
#Copy Number Variation (CNV's) refers to the duplication or deletion of DNA segments larger than 1 kb.
#CNV's are structural variations in the genome which range in length between 50 bp and 1 Mbp.
#Copy number variations or CNVs are the structural variations that cover more than 1kb of DNA sequence.
#Copy number variation (CNV) is a frequently observed deviation from the diploid state due to duplication or deletion of genomic regions.
#The single nucleotide polymorphism (SNP), on the other hand, is a single nucleotide change or a point mutation that is found in more than 1% of the population.
#Both CNV and SNPs are immensely valuable in genetic screening studies and kinship analysis.There are five forms of CNVs.
#The first is called a deletion. A loss of a DNA segment can reduce the copy number of a gene or a group of genes.
#The second is called tandem duplication. Here, a copy of a chromosomal segment is inserted into an adjacent region.
#The third is called noncontiguous duplication. Here, a chromosomal segment duplicates and inserts into a distant chromosomal region or a different chromosome.
#The fourth form is called Multiallelic CNV. A segment of DNA duplicates several times and results in the formation of multiple alleles of a gene.
#The fifth form is called complex rearrangement.
#CNVs are widespread among humans - on an average 12 CNVs exist per individual in comparison to the reference genome.
#They have also been shown to play a role in diseases such as autism, breast cancer, obesity, Alzheimer’s disease and schizophrenia among other diseases.
#Germ line versus somatic CNV Germ line CNV are relatively short (a few bp to a few Mbp) copy number changes that the individual inherits from one of the two
#parental gametes and thus are typically present in 100% of cells.
#Somatic CNV (often called CNA where A stands for alterations or aberration) are copy number changes of any size and amount (from a few bases to whole chromosomes)
#that happen (and often carry on happening) in cancer cells. Cancer cells can be aneuploid (that means they are largely triploid, tetraploid or even aploid)
#and can have high focal amplifications (some regions could have many copies: it is not unusual to have 8-12 copies for some regions).
#Furthermore, because tumor samples are typically an admixture of normal and cancer cells, the tumor purity in unknown and variable.
#Different algorithms make different assumptions while handling somatic or germ line CNV. Typically, germ line cnv caller can assume:
#The genome is largely diploid.
#The sample is pure and homogeneous.
#Any gain or loss should be 50% move or 50% less coverage.
#For these reasons, the algorithms can focus more on associating p-values for each call; it is possible to estimate false positive and false negative rates.
#Somatic CNA callers cannot make any of the assumption above, or if they do, they have limited scope.
#CNVs can be experimentally detected based on comparative genomic hybridization, and computationally inferred from SNP-arrays or next-generation sequencing data.
#These technologies for CNV detection have in common that they report, for each sample under study, genomic regions that are duplicated or deleted with respect to a reference.
#Such regions are denoted as CNV calls in the following and will be considered the starting point for analysis.
#CNVs can be experimentally detected based on comparative genomic hybridization, and computationally inferred from SNP-arrays or next-generation sequencing data.
#These technologies for CNV detection have in common that they report, for each sample under study, genomic regions that are duplicated or deleted with respect to a reference.
#Such regions are denoted as CNV calls and will be considered the starting point for analysis with the CNVRanger package.
#The CNVRanger package imports CNV calls from a simple file format into R, and stores them in dedicated Bioconductor data structures,
#and implements three frequently used approaches for summarizing CNV calls across a population:
#(i) the CNVRuler procedure that trims region margins based on regional density Kim et al., 2012,
#(ii) the reciprocal overlap procedure that requires sufficient mutual overlap between calls Conrad et al., 2010, and
#(iii) the GISTIC procedure that identifies recurrent CNV regions Beroukhim et al., 2007.
#CNVRanger builds on regioneR for overlap analysis of CNVs with functional genomic regions, and implements RNA-seq expression Quantitative Trait Loci (eQTL) analysis
#for CNVs by interfacing with edgeR,
#CNVRanger reads CNV calls from a simple file format, providing at least chromosome, start position, end position, sample ID, and integer copy number for each call.
#The last column contains the integer copy number state for each call, encoded as
#0: homozygous deletion (2-copy loss)
#1: heterozygous deletion (1-copy loss)
#2: normal diploid state
#3: 1-copy gain
#4: amplification (>= 2-copy gain)
#For CNV detection software that uses a different encoding, it is necessary to convert to the above encoding. For example, the GISTIC2 procedure that was used to
#generate our Sumamrized Experiment CNV block, uses the following format which can be converted by simply adding 2:
#-2: homozygous deletion (2-copy loss)
#-1: heterozygous deletion (1-copy loss)
#0: normal diploid state
#1: 1-copy gain
#2: amplification (>= 2-copy gain)
#In CNV analysis, it is often of interest to summarize individual calls across the population, (i.e. to define CNV regions), for subsequent association analysis with expression
#and phenotype data. In the simplest case, this just merges overlapping individual calls into summarized regions. However, this typically inflates CNV region size,
#and more appropriate approaches have been developed for this purpose.There is need for quality control of CNV calls and appropriate accounting for sources of technical bias
#before applying these summarization functions (or in general downstream analysis with CNVRanger).For instance, protocols for read-depth CNV calling typically exclude calls
#overlapping defined repetitive and low-complexity regions including the UCSC list of segmental duplications Trost et al., 2018, Zhou et al., 2018. We also note that CNVnator,
#a very popular read-depth CNV caller, implements the q0-filter to explicitely flag and, if desired, exclude calls that are likely to stem from such regions.
#If systematically over-represented in the input CNV calls, summarization procedures such as GISTIC will identify these regions as recurrent independent of whether there
#are biological or technical reasons for that.In particular in cancer, it is important to distinguish driver from passenger mutations, i.e. to distinguish meaningful events from random background aberrations.
#The GISTIC method identifies those regions of the genome that are aberrant more often than would be expected by chance, with greater weight given to high amplitude events
#(high-level copy-number gains or homozygous deletions) that are less likely to represent random aberrations
#GISTIC is a tool to identify genes targeted by somatic copy number variation (CNV). The GISTIC algorithm defines CNV boundaries by a user-defined confidence level.
#Module Name: GISTIC2
#Description: Genomic Identification of Significant Targets in Cancer, version 2.0
#Authors: Gad Getz, Rameen Beroukhim, Craig Mermel, Steve Schumacher and Jen Dobson
#Date: 27 Mar 2017
#Release: 2.0.23
#Software interface: Command-line user interface
#Language: Matlab
#Operating system: Linux
#The GISTIC module identifies regions of the genome that are significantly amplified or deleted across a set of samples.
#Each aberration is assigned a G-score that considers the amplitude of the aberration as well as the frequency of its occurrence across samples.
#False Discovery Rate q-values are then calculated for the aberrant regions, and regions with q-values below a user-defined threshold are considered significant.
#For each significant region, a "peak region" is identified, which is the part of the aberrant region with greatest amplitude and frequency of alteration.
#In addition, a "wide peak" is determined using a leave-one-out algorithm to allow for errors in the boundaries in a single sample.
#The "wide peak" boundaries are more robust for identifying the most likely gene targets in the region. Each significantly aberrant region is also tested to
#determine whether it results primarily from broad events (longer than half a chromosome arm), focal events, or significant levels of both.
#The GISTIC module reports the genomic locations and calculated q-values for the aberrant regions. It identifies the samples that exhibit each significant
#amplification or deletion, and it lists genes found in each "wide peak" region.
#According to website https://www.bioconductor.org/packages/release/bioc/vignettes/MultiAssayExperiment/inst/doc/QuickStartMultiAssay.html,
#the assay matrix of our non-Genomic Range Summarized Experiment (gistict: SummarizedExperiment with 198 rows and 43 columns)
#obtained via miniACC MUltiAssayExperiment represents the GISTIC genomic copy number by gene. This apparently is a summary of filtered and statistically
#significant gene-based recurrent copy number lesions identified by GISTIC2 identified via the aforementioned GISTIC2 procedure
#DIFFERENTIAL CNV gistic peaks ACROSS YOUNG AND OLD PATIENTS:
#Exploring the SummarizedExperiemnt extracted from the initial miniACC MultiAssayExperiment:
#TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages
cnv_gistic<-miniACC.assays.comp.age.cnvcalls.ranges[[3]]
#Alternatively:
mACC.CN3
## class: SummarizedExperiment
## dim: 198 10
## metadata(0):
## assays(1): ''
## rownames(198): DIRAS3 MAPK14 ... SQSTM1 KCNJ13
## rowData names(3): Gene.Symbol Locus.ID Cytoband
## colnames(10): TCGA-OR-A5J9-01A-11D-A29H-01 TCGA-OR-A5JE-01A-11D-A29H-01
## ... TCGA-OR-A5LE-01A-11D-A29H-01 TCGA-OR-A5LL-01A-11D-A29H-01
## colData names(0):
#Creating a phenotype dataframe:
phenoN3 <- data.frame(sample=colnames(assay(mACC.CN3)),patientID=colData(miniACC.assays.comp.age)$patientID, age.status=colData(miniACC.assays.comp.age)$years_to_birth)
rownames(phenoN3)<-phenoN3$sample
cond2<-phenoN3$age.status
gistic.peaks <- as.matrix(assay(mACC.CN3))
sum(is.na(gistic.peaks))
## [1] 0
#As part of the exploration, we plot data
boxplot(gistic.peaks)
boxplot(log2(gistic.peaks+2))
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 2 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 8 is not drawn
#Hierarchical clustering
x_cnv<-gistic.peaks
#Euclidean distance
clust.cor.ward <- hclust(dist(t(x_cnv)),method="ward.D2")
plot(clust.cor.ward, main="hierarchical clustering", hang=-1,cex=0.8)
#The ward.D2 hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients
clust.cor.average <- hclust(dist(t(x_cnv)),method="average")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8)
#The average hierarchal clustering DOES NOT appear to reflect the segregation of 5 old and 5 young patients
clust.cor.average <- hclust(dist(t(x_cnv)),method="complete")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8)
#The complete hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients
#Correlation based distance
clust.cor.ward <- hclust(as.dist(1-cor(x_cnv)),method="ward.D2")
plot(clust.cor.ward, main="hierarchical clustering", hang=-1,cex=0.8)
#The ward.D2 hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients
clust.cor.average<- hclust(as.dist(1-cor(x_cnv)),method="average")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8)
#The average hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients
sum1<-sum(is.na(gistic.peaks))
sum1
## [1] 0
#Density plot of gistic peaks (log10)
#gistic.peaks_log <- log(gistic.peaks,10)
#d <- density(gistic.peaks_log)
#plot(d,xlim=c(1,8),main="",ylim=c(0,.45),xlab="Raw CNV gistic peaks per gene after log10 transformation)", ylab="Density")
#for (s in 1:length(colnames(gistic.peaks_log))){
# gistic.peaks_log <- log(gistic.peaks[,s],10)
# d <- density(gistic.peaks_log)
# lines(d)
#}
#Error in density.default(gistic.peaks_log) : 'x' contains missing values
#Box plots of raw gistic peaks after log10 transformation
gistic.peaks_log <- log(gistic.peaks,10)
## Warning: NaNs produced
boxplot(gistic.peaks_log , main="", xlab="", ylab="Raw CNV gistic peaks per gene after log10 transformation)",axes=FALSE)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 6 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 7 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 8 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 9 is not drawn
axis(2)
axis(1,at=c(1:length(colnames(gistic.peaks_log))),labels=colnames(gistic.peaks_log),las=2,cex.axis=0.8)
#Plot Heatmap with condition age.status as labels
colnames(gistic.peaks)<-phenoN3$age.status
heatmap(gistic.peaks, col = topo.colors(50), margin=c(10,6))
#patient is expressing many trcurrent genes lesions
#PCA
summary(pca.filt <- prcomp(t(x_cnv), scale=T ))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 7.630 6.488 5.472 4.642 4.3106 3.2729 2.9623 2.1138
## Proportion of Variance 0.294 0.213 0.151 0.109 0.0939 0.0541 0.0443 0.0226
## Cumulative Proportion 0.294 0.507 0.658 0.767 0.8605 0.9146 0.9589 0.9815
## PC9 PC10
## Standard deviation 1.9156 2.56e-15
## Proportion of Variance 0.0185 0.00e+00
## Cumulative Proportion 1.0000 1.00e+00
autoplot(pca.filt, data=phenoN3, colour="patientID", shape="age.status")
#There does not appear to be segregation by age status
#Note that a total of 21.26%+ 29.4 %= 50.66% variance is accounted for by the
#first 2 principal components PC1 and PC2 and corresponding eigenvector values
#GGBIO VISUALIZATION OF GISTIC COPY NUMBER VARIATION (CNV) RECCURENT REGIONS:
hg19sub
## GRanges object with 22 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 1-249250621 *
## [2] 2 1-243199373 *
## [3] 3 1-198022430 *
## [4] 4 1-191154276 *
## [5] 5 1-180915260 *
## ... ... ... ...
## [18] 18 1-78077248 *
## [19] 19 1-59128983 *
## [20] 20 1-63025520 *
## [21] 21 1-48129895 *
## [22] 22 1-51304566 *
## -------
## seqinfo: 22 sequences from hg19 genome
autoplot(hg19sub, layout = "circle", fill = "gray70")
#Use the same data to create ideogram, label and scale track, it layouts the circle by the
# order created from inside to outside
#p <- ggbio() + circle(hg19sub, geom = "ideo", fill = "gray70") +
# circle(hg19sub, geom = "scale", size = 2) +
# circle(hg19sub, geom = "text", aes(label = seqnames),
# vjust = 0, size = 3)
#p
# Then we add a "rectangle" track to show somatic CNV recurrent regions states which will looks like vertical segments.
cnv_gistic<-miniACC.assays.comp.age.cnvcalls.ranges[[3]]
cnv_gr<-rowRanges(cnv_gistic)
p <- ggbio() + circle(cnv_gr, geom = "rect", color = "steelblue") +
circle(hg19sub, geom = "ideo", fill = "gray70") +
circle(hg19sub, geom = "scale", size = 2) +
circle(hg19sub, geom = "text", aes(label = seqnames),
vjust = 0, size = 3)
p
#Because copy number variation analysis is not mentioned in the DESeq / DESeq2 manual or edgeR, we don't use DESeq / DESeq2 for that purpose.
#The data distribution of CNV data will not match that expected by DESeq which expects a negative binomial distribution.
#CNV data is measured as discrete intervals, and so something like a Hidden Markov Model (HMM) is more commonly employed although it can be measured on a continuous scale too.
#The "fundamental limitation" of trying to detect CNV from RNA-seq relates to the fact that a copy number event does not necessarily alter gene expression levels.
#A gene could easily be duplicated, for example, but, without the promoter sequence and/or transcription start site (TSS),
#it will not be expressed (or just expressed at negligible levels).EdgeR and DESeq2 can be used for ChIPSeq mostly for differential peak calling which is different from CNV.
#Data is counts and distribution is in accordance with RNAseq.CNV calling with a DE tool having the assumption that data is normally distributed does not in
#any way accord for finding CNV which works on discrete data. One needs to find the right tool and the right distribution for finding CNVs and there are plenty of
#technology to produce the data and tools to generate copy profiles from those data. One important this is properly accounting for allelic frequencies
#while scanning through the genome and then using segmentation for finding copy ratios. This cannot be done with DESeq2.
#Most DE tools assume that the biological variation has a continuous distribution (e.g. normal or gamma), but variation due to CNV would be discrete at
#integer multiples of the haploid coverage depth.
#Other options: Window the genome in to 10kb bins; Compute the reads number in every bins;Normalize the sequence depth and make sure the CNV value in every bin are in the same scale to have a #valid comparison. Use HMMcopy to tackle GC bias; CNVkit uses normals to create a reference to which it'll compare each sample.
#CNVkit is a Python library and command-line software toolkit to infer and visualize copy number from high-throughput DNA sequencing data. It is designed for use with hybrid capture, including both #whole-exome and custom target panels, and short-read sequencing platforms such as Illumina and Ion Torrent.
#DIFFERENTIAL GISTIC CNV ANALYSIS
#We preliminarily use simplified linear regression model to assess differences in GISTIC gene-based recurrent lesion copy number variation:
x_cnv_model<-x_cnv
colnames(x_cnv_model)<-cond2
x_cnv_model.t<-t(x_cnv_model)
x_cnv_model.t.df<-as.data.frame(x_cnv_model.t)
x_cnv_model.t.df$age.status<-as.factor(cond2)
#x_cnv_model.t.df
#Example of simple linear regression with single categorical variable factor age.status for first gene CNV:
summary(lm(x_cnv_model.t.df$DIRAS3 ~ x_cnv_model.t.df$age.status))
##
## Call:
## lm(formula = x_cnv_model.t.df$DIRAS3 ~ x_cnv_model.t.df$age.status)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8 -0.4 0.2 0.2 0.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.600 0.224 -2.68 0.028 *
## x_cnv_model.t.df$age.statusyoung 0.400 0.316 1.26 0.242
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5 on 8 degrees of freedom
## Multiple R-squared: 0.167, Adjusted R-squared: 0.0625
## F-statistic: 1.6 on 1 and 8 DF, p-value: 0.242
#Evidently, with p-value=0.242, DIRAS3 gene GISTIC CNV does not appear to be related to age.status
#Now Run n regressions for all genes
my_lms <- lapply(1:((ncol(x_cnv_model.t.df))-1), function(x) lm(x_cnv_model.t.df[,x] ~ x_cnv_model.t.df$age.status))
# Extract just coefficients
sapply(my_lms, coef)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## (Intercept) -0.6 -0.4 -0.2 0.4 -0.2 -0.2 0.2 0.4 0.2
## x_cnv_model.t.df$age.statusyoung 0.4 0.4 -0.2 0.2 0.0 0.4 0.2 0.2 0.4
## [,10] [,11] [,12] [,13] [,14] [,15]
## (Intercept) 0.4 -0.4 2.00e-01 2.00e-01 -0.6 -0.2
## x_cnv_model.t.df$age.statusyoung -0.2 0.8 1.05e-16 -2.11e-16 0.4 -0.2
## [,16] [,17] [,18] [,19] [,20] [,21]
## (Intercept) 2.00e-01 2.00e-01 0.4 6.0e-01 0.2 0.4
## x_cnv_model.t.df$age.statusyoung -2.11e-16 -2.11e-16 -0.4 1.4e-16 0.4 -0.4
## [,22] [,23] [,24] [,25] [,26] [,27]
## (Intercept) 1.23e-16 7.02e-17 0.0 0.4 -0.6 -4.0e-01
## x_cnv_model.t.df$age.statusyoung 2.00e-01 -4.00e-01 -0.2 -0.4 0.2 1.4e-16
## [,28] [,29] [,30] [,31] [,32] [,33] [,34]
## (Intercept) -4.0e-01 0.0 -0.6 -0.2 -0.2 -0.4 0.4
## x_cnv_model.t.df$age.statusyoung 1.4e-16 -0.2 0.2 0.0 -0.4 0.6 -0.2
## [,35] [,36] [,37] [,38] [,39] [,40] [,41]
## (Intercept) -0.2 -0.6 -0.4 0.4 0.6 0.0 -0.4
## x_cnv_model.t.df$age.statusyoung 0.0 0.4 0.4 -0.4 -0.2 -0.2 0.4
## [,42] [,43] [,44] [,45] [,46] [,47]
## (Intercept) -0.6 0.2 -4.0e-01 -0.4 0.0 3.51e-17
## x_cnv_model.t.df$age.statusyoung 0.4 0.4 -1.4e-16 -0.2 -0.2 4.00e-01
## [,48] [,49] [,50] [,51] [,52] [,53] [,54]
## (Intercept) -0.2 6.0e-01 0.8 0.6 0.2 1.23e-16 0.8
## x_cnv_model.t.df$age.statusyoung -0.2 1.4e-16 -0.2 -0.4 0.4 2.00e-01 -0.2
## [,55] [,56] [,57] [,58] [,59] [,60] [,61]
## (Intercept) 0.4 -0.4 0.8 0.6 -0.2 0.4 6.00e-01
## x_cnv_model.t.df$age.statusyoung -0.2 0.4 -0.2 -0.2 -0.4 -0.4 7.02e-17
## [,62] [,63] [,64] [,65] [,66] [,67]
## (Intercept) -0.2 0.8 -4.0e-01 0.4 -4.0e-01 0.8
## x_cnv_model.t.df$age.statusyoung 0.0 -0.2 -1.4e-16 -0.4 1.4e-16 -0.4
## [,68] [,69] [,70] [,71] [,72] [,73] [,74]
## (Intercept) 0.2 0.4 0.4 6.0e-01 -0.4 -0.4 -0.4
## x_cnv_model.t.df$age.statusyoung 0.2 0.2 0.2 1.4e-16 -0.2 0.4 0.0
## [,75] [,76] [,77] [,78] [,79] [,80]
## (Intercept) 0.6 -0.4 6.0e-01 -0.6 3.51e-17 2.00e-01
## x_cnv_model.t.df$age.statusyoung -0.2 0.4 1.4e-16 0.4 -4.00e-01 -2.11e-16
## [,81] [,82] [,83] [,84] [,85] [,86] [,87]
## (Intercept) -0.2 6.0e-01 -0.2 0.4 0.4 0.6 6.0e-01
## x_cnv_model.t.df$age.statusyoung 0.0 1.4e-16 0.0 -0.4 -0.2 -0.2 1.4e-16
## [,88] [,89] [,90] [,91] [,92] [,93] [,94]
## (Intercept) 0.2 -0.6 0.4 -0.6 0.2 -4.0e-01 6.0e-01
## x_cnv_model.t.df$age.statusyoung 0.2 0.4 0.2 0.4 0.2 -1.4e-16 1.4e-16
## [,95] [,96] [,97] [,98] [,99] [,100] [,101]
## (Intercept) -0.2 -0.2 -0.6 -0.4 -0.6 0.2 0.6
## x_cnv_model.t.df$age.statusyoung 0.0 0.0 0.4 1.0 0.4 0.4 -0.4
## [,102] [,103] [,104] [,105] [,106] [,107]
## (Intercept) -0.4 6.0e-01 -0.4 0.8 0.8 0.0
## x_cnv_model.t.df$age.statusyoung -0.2 1.4e-16 0.0 -0.4 -0.2 -0.2
## [,108] [,109] [,110] [,111] [,112] [,113]
## (Intercept) -4.0e-01 0.4 0.4 -0.2 0.8 -0.6
## x_cnv_model.t.df$age.statusyoung -1.4e-16 0.2 -0.2 0.2 -0.4 0.4
## [,114] [,115] [,116] [,117] [,118] [,119]
## (Intercept) 0.8 6.0e-01 0.2 0.8 0.2 0.8
## x_cnv_model.t.df$age.statusyoung -0.2 1.4e-16 0.4 -0.2 0.2 -0.2
## [,120] [,121] [,122] [,123] [,124]
## (Intercept) 0.6 2.00e-01 -0.4 -2.00e-01 0.4
## x_cnv_model.t.df$age.statusyoung -0.4 -2.11e-16 0.4 1.76e-17 -0.4
## [,125] [,126] [,127] [,128] [,129] [,130]
## (Intercept) 0.8 0.4 -0.6 6.0e-01 -0.4 6.0e-01
## x_cnv_model.t.df$age.statusyoung -0.2 -0.2 0.4 1.4e-16 -0.2 1.4e-16
## [,131] [,132] [,133] [,134] [,135] [,136]
## (Intercept) -0.6 0.4 -0.4 3.51e-17 0.2 0.8
## x_cnv_model.t.df$age.statusyoung 0.4 0.2 0.0 2.00e-01 0.2 -0.2
## [,137] [,138] [,139] [,140] [,141] [,142]
## (Intercept) 0.4 0.4 0.8 -0.4 -0.6 4.00e-01
## x_cnv_model.t.df$age.statusyoung -0.4 0.2 -0.4 -0.2 0.4 7.02e-17
## [,143] [,144] [,145] [,146] [,147] [,148]
## (Intercept) 2.00e-01 0.4 0.4 -0.2 -0.4 0.8
## x_cnv_model.t.df$age.statusyoung 1.05e-16 0.2 0.2 0.0 -0.2 -0.2
## [,149] [,150] [,151] [,152] [,153]
## (Intercept) 0.4 2.00e-01 0.0 6.0e-01 0.4
## x_cnv_model.t.df$age.statusyoung 0.2 -2.11e-16 -0.2 1.4e-16 0.2
## [,154] [,155] [,156] [,157] [,158]
## (Intercept) -1.05e-16 -0.2 -1.58e-16 0.6 -0.2
## x_cnv_model.t.df$age.statusyoung 4.00e-01 0.0 2.00e-01 -0.4 -0.2
## [,159] [,160] [,161] [,162] [,163] [,164]
## (Intercept) -0.2 6.0e-01 0.8 0.2 0.4 0.2
## x_cnv_model.t.df$age.statusyoung 0.2 1.4e-16 -0.2 0.2 -0.4 0.2
## [,165] [,166] [,167] [,168] [,169] [,170]
## (Intercept) 0.2 0.2 -0.4 -0.6 -0.6 -0.4
## x_cnv_model.t.df$age.statusyoung 0.4 0.4 -0.2 0.4 0.4 0.4
## [,171] [,172] [,173] [,174] [,175] [,176]
## (Intercept) 0.4 0.6 7.02e-17 -0.4 0.4 0.4
## x_cnv_model.t.df$age.statusyoung 0.2 -0.4 -2.00e-01 -0.2 0.2 -0.4
## [,177] [,178] [,179] [,180] [,181] [,182]
## (Intercept) 0.4 -0.4 -8.78e-17 0.2 4.00e-01 -0.2
## x_cnv_model.t.df$age.statusyoung 0.2 0.2 2.00e-01 0.4 7.02e-17 0.0
## [,183] [,184] [,185] [,186] [,187]
## (Intercept) -0.4 0.4 -0.2 2.00e-01 3.51e-17
## x_cnv_model.t.df$age.statusyoung 0.4 -0.2 0.2 1.05e-16 2.00e-01
## [,188] [,189] [,190] [,191] [,192]
## (Intercept) 2.00e-01 6.0e-01 0.4 -1.05e-16 -0.2
## x_cnv_model.t.df$age.statusyoung -2.11e-16 1.4e-16 -0.2 4.00e-01 -0.2
## [,193] [,194] [,195] [,196] [,197] [,198]
## (Intercept) 0.4 0.4 1.23e-16 0 0.8 1.23e-16
## x_cnv_model.t.df$age.statusyoung 0.2 -0.4 2.00e-01 0 -0.2 2.00e-01
#For more info, get full summary call:
summaries <- lapply(my_lms, summary)
#Coefficents with p values:
p_values<-lapply(summaries, function(x) x$coefficients[, c(1,4)])
#Evidently, the lowest p-value of 0.0656 was obtained from list item index#98
gene_cnv<-colnames(x_cnv_model.t.df[98])
gene_cnv
## [1] "FOXO3"
#The gene that had the lowest p-value for differential GISTIC cnv value with respect to young/old age.status is FOXO3.
#r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared, adj_r_sq = x$adj.r.squared))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## r_sq 0.1667 1.11e-01 0.0476 0.0222 1.64e-32 0.04 0.0244 0.0222
## adj_r_sq 0.0625 -2.22e-16 -0.0714 -0.1000 -1.25e-01 -0.08 -0.0976 -0.1000
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## r_sq 0.0909 0.0476 0.267 2.27e-32 1.83e-32 0.1667 0.0476 1.83e-32
## adj_r_sq -0.0227 -0.0714 0.175 -1.25e-01 -1.25e-01 0.0625 -0.0714 -1.25e-01
## [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## r_sq 1.83e-32 0.111 5.65e-32 0.0909 0.111 0.0345 0.111 0.0345 0.111
## adj_r_sq -1.25e-01 0.000 -1.25e-01 -0.0227 0.000 -0.0862 0.000 -0.0862 0.000
## [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33]
## r_sq 0.04 2.57e-32 2.57e-32 0.0345 0.04 1.64e-32 0.0625 0.310
## adj_r_sq -0.08 -1.25e-01 -1.25e-01 -0.0862 -0.08 -1.25e-01 -0.0547 0.224
## [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42]
## r_sq 0.0476 1.64e-32 0.1667 1.11e-01 0.111 0.0154 0.0345 0.250 0.1667
## adj_r_sq -0.0714 -1.25e-01 0.0625 -2.22e-16 0.000 -0.1077 -0.0862 0.156 0.0625
## [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
## r_sq 0.0909 2.05e-32 0.04 0.0345 0.0526 0.0476 7.70e-32 0.0476
## adj_r_sq -0.0227 -1.25e-01 -0.08 -0.0862 -0.0658 -0.0714 -1.25e-01 -0.0714
## [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58] [,59]
## r_sq 0.1667 0.0909 0.0345 0.0476 0.0476 0.250 0.0476 0.0154 0.0909
## adj_r_sq 0.0625 -0.0227 -0.0862 -0.0714 -0.0714 0.156 -0.0714 -0.1077 -0.0227
## [,60] [,61] [,62] [,63] [,64] [,65] [,66] [,67]
## r_sq 0.250 7.70e-32 1.64e-32 0.0476 1.05e-31 0.111 1.93e-32 0.1667
## adj_r_sq 0.156 -1.25e-01 -1.25e-01 -0.0714 -1.25e-01 0.000 -1.25e-01 0.0625
## [,68] [,69] [,70] [,71] [,72] [,73] [,74] [,75] [,76]
## r_sq 0.0244 0.0222 0.0222 7.70e-32 0.04 1.11e-01 0.000 0.0222 0.250
## adj_r_sq -0.0976 -0.1000 -0.1000 -1.25e-01 -0.08 -2.22e-16 -0.125 -0.1000 0.156
## [,77] [,78] [,79] [,80] [,81] [,82] [,83] [,84]
## r_sq 7.70e-32 0.1667 0.0714 1.83e-32 1.64e-32 3.92e-32 1.64e-32 0.250
## adj_r_sq -1.25e-01 0.0625 -0.0446 -1.25e-01 -1.25e-01 -1.25e-01 -1.25e-01 0.156
## [,85] [,86] [,87] [,88] [,89] [,90] [,91] [,92]
## r_sq 0.0476 0.0154 7.70e-32 0.0244 0.1667 0.0222 0.1667 0.0244
## adj_r_sq -0.0714 -0.1077 -1.25e-01 -0.0976 0.0625 -0.1000 0.0625 -0.0976
## [,93] [,94] [,95] [,96] [,97] [,98] [,99] [,100]
## r_sq 2.05e-32 3.92e-32 1.64e-32 1.64e-32 0.1667 0.362 0.1667 0.0909
## adj_r_sq -1.25e-01 -1.25e-01 -1.25e-01 -1.25e-01 0.0625 0.283 0.0625 -0.0227
## [,101] [,102] [,103] [,104] [,105] [,106] [,107] [,108]
## r_sq 0.1667 0.04 5.65e-32 0.000 0.1667 0.0476 0.0345 2.05e-32
## adj_r_sq 0.0625 -0.08 -1.25e-01 -0.125 0.0625 -0.0714 -0.0862 -1.25e-01
## [,109] [,110] [,111] [,112] [,113] [,114] [,115] [,116]
## r_sq 0.0222 0.0476 0.0345 0.1667 0.1667 0.0476 5.65e-32 0.0909
## adj_r_sq -0.1000 -0.0714 -0.0862 0.0625 0.0625 -0.0714 -1.25e-01 -0.0227
## [,117] [,118] [,119] [,120] [,121] [,122] [,123] [,124]
## r_sq 0.0476 0.0244 0.0476 0.1667 1.83e-32 0.250 7.22e-33 0.111
## adj_r_sq -0.0714 -0.0976 -0.0714 0.0625 -1.25e-01 0.156 -1.25e-01 0.000
## [,125] [,126] [,127] [,128] [,129] [,130] [,131] [,132]
## r_sq 0.0476 0.0164 0.1667 5.65e-32 0.04 5.65e-32 0.1667 0.0222
## adj_r_sq -0.0714 -0.1066 0.0625 -1.25e-01 -0.08 -1.25e-01 0.0625 -0.1000
## [,133] [,134] [,135] [,136] [,137] [,138] [,139] [,140] [,141]
## r_sq 0.000 0.0345 0.0244 0.0476 0.111 0.0222 0.1667 0.04 0.1667
## adj_r_sq -0.125 -0.0862 -0.0976 -0.0714 0.000 -0.1000 0.0625 -0.08 0.0625
## [,142] [,143] [,144] [,145] [,146] [,147] [,148] [,149]
## r_sq 8.40e-33 2.27e-32 0.0222 0.0118 1.64e-32 0.04 0.0476 0.0222
## adj_r_sq -1.25e-01 -1.25e-01 -0.1000 -0.1118 -1.25e-01 -0.08 -0.0714 -0.1000
## [,150] [,151] [,152] [,153] [,154] [,155] [,156] [,157]
## r_sq 1.83e-32 0.0345 5.65e-32 0.0222 0.0714 1.64e-32 0.0145 0.1667
## adj_r_sq -1.25e-01 -0.0862 -1.25e-01 -0.1000 -0.0446 -1.25e-01 -0.1087 0.0625
## [,158] [,159] [,160] [,161] [,162] [,163] [,164] [,165]
## r_sq 0.0476 0.0345 5.65e-32 0.0476 0.0244 0.111 0.0123 0.0909
## adj_r_sq -0.0714 -0.0862 -1.25e-01 -0.0714 -0.0976 0.000 -0.1111 -0.0227
## [,166] [,167] [,168] [,169] [,170] [,171] [,172] [,173] [,174]
## r_sq 0.0909 0.04 0.1667 0.1667 1.11e-01 0.0222 0.1667 0.0204 0.04
## adj_r_sq -0.0227 -0.08 0.0625 0.0625 -2.22e-16 -0.1000 0.0625 -0.1020 -0.08
## [,175] [,176] [,177] [,178] [,179] [,180] [,181] [,182]
## r_sq 0.0118 0.111 0.0118 0.0476 0.0204 0.0909 8.40e-33 1.64e-32
## adj_r_sq -0.1118 0.000 -0.1118 -0.0714 -0.1020 -0.0227 -1.25e-01 -1.25e-01
## [,183] [,184] [,185] [,186] [,187] [,188] [,189] [,190]
## r_sq 0.250 0.0476 0.0345 2.27e-32 0.0345 1.83e-32 5.65e-32 0.0476
## adj_r_sq 0.156 -0.0714 -0.0862 -1.25e-01 -0.0862 -1.25e-01 -1.25e-01 -0.0714
## [,191] [,192] [,193] [,194] [,195] [,196] [,197] [,198]
## r_sq 0.0714 0.0123 0.0222 0.250 0.0345 8.65e-32 0.0476 0.0345
## adj_r_sq -0.0446 -0.1111 -0.1000 0.156 -0.0862 -1.25e-01 -0.0714 -0.0862
#The models are stored in a list, where model 3 is in my_lms[[3]] and so on.
#plot(x_cnv_model.t.df[,x], pch = 16, col = "blue") #Plot the results
#abline(lmTemp) #Add a regression line
#summary(lmTemp)
#plot(lmTemp$residuals, pch = 16, col = "red")
#We now explore reading and processing GISTIC files and data via 3 alternative approaches (maftools, readGISTIC, drug_prediction)
#before later treating our GISTIC recurrent lesion matrix from miniACC as individual call matrix to be then used in CNVRanger function approach:
#TESTING MODIFIED DRUG PREDICTION FUNCTIONS TO PROCESS CNV GISTIC DATA
cnvs_drug<-as.data.frame(rowRanges(cnv_gistic))
cnv_df<-as.data.frame(assay(cnv_gistic))
#Make sure rownames() are samples, and colnames() are genes by transposing dataframe.
cnv_df.t<-t(cnv_df)
#Determine the number of samples we want the CNVs to be amplified in. The default is 10.
n=10
#Indicate whether or not we want to test cnv data. If TRUE, we will test cnv data. If FALSE, we will test mutation data.
cnv=TRUE
wd<-tempdir()
savedir<-setwd(wd)
#Apply map_cnv() function to produce the file map.RData, which stores the object 'theCnvQuantVecList_mat'
#map_cnv(Cnvs=cnvs_drug)
#Error in map_cnv(Cnvs = cnvs_drug) :
#ERROR: Check colnames() of cnv data. colnames() must include Sample, Chromosome, Start, End, and Segment_Mean
#> 403 genes were dropped because they have exons located on both strands
#> of the same reference sequence or on more than one reference sequence,
#> so cannot be represented by a single genomic range.
#> Use 'single.strand.genes.only=FALSE' to get all the genes in a
#> GRangesList object, or use suppressMessages() to suppress this message.
#load('map.RData') #This loads the object 'theCnvQuantVecList_mat', which was obtained using map_cnv()
#Make sure this data is a data frame and that colnames() are samples.
#data<-as.data.frame(t(theCnvQuantVecList_mat))
#samps<-colnames(data)
#colnames(data)<-substr(samps,1,nchar(samps)-12)
#Apply idwas()#Apply idwas() to test each cnv and each drug. The p-values and beta-values for each test will be exported
#idwas(drug_prediction=cnv_df.t , data=data, n=n, cnv=cnv)
#THIS APPROACH YIELDED ERRORS DURING EXECUTION AND WAS ABANDONED
#TESTING READ TCGA ACC GISTIC DATA DIRECTLY USING SPECIALIZED FUNCTIONS FOR GISTIC S4Vector OBJECT and summarize output files generated by GISTIC programme:
#The readGistic function can take above files provided manually, or a directory containing GISTIC results and import all the relevant files:
#readGistic(gisticAllLesionsFile = NULL,gisticAmpGenesFile = NULL,gisticDelGenesFile = NULL,gisticScoresFile = NULL,cnLevel = "all",isTCGA = FALSE,verbose = TRUE)
#Arguments
#gisticAllLesionsFile = All Lesions file generated by gistic. e.g; all_lesions.conf_XX.txt, where XX is the confidence level. Required. Default NULL.
#gisticAmpGenesFile=Amplification Genes file generated by gistic. e.g; amp_genes.conf_XX.txt, where XX is the confidence level. Default NULL.
#gisticDelGenesFile=Deletion Genes file generated by gistic. e.g; del_genes.conf_XX.txt, where XX is the confidence level. Default NULL.
#gisticScoresFile=scores.gistic file generated by gistic.
#cnLevel = level of CN changes to use. Can be 'all', 'deep' or 'shallow'. Default uses all i.e, genes with both 'shallow' or 'deep' CN changes
#isTCGA= Is the data from TCGA. Default FALSE.
#verbose= Default TRUE
#Evidently, We REQUIRE the first of four files that are generated by GISTIC: i.e, all_lesions.conf_XX.txt.
#Based on the Sakar Khan's following youtube video Copy Number Variation Analysis using GISTIC - Tutorial :https://www.youtube.com/watch?v=Ssw7Ryao1x4&t=30s
#and based on the following website url https://www.genepattern.org/modules/docs/GISTIC_2.0#gsc.tab=0
#The format for this initial file includes the following columns:
#All Lesions File (all_lesions.conf_XX.txt, where XX is the confidence level)
#The all lesions file summarizes the results from the GISTIC run. It contains data about the significant regions of amplification and deletion as well as which samples are amplified or deleted in each of these regions. The identified regions are listed down the firstcolumn, and the samples are listed across the first row, starting in column 10.
#Region Data
#Columns 1-9 present the data about the significant regions as follows:
#Unique Name: A name assigned to identify the region.
#Descriptor: The genomic descriptor of that region
#Wide Peak Limits: The “wide peak” boundaries most likely to contain the targeted genes. These are listed in genomic coordinates and marker (or probe) indices.
#Peak Limits: The boundaries of the region of maximal amplification or deletion.
#Region Limits: The boundaries of the entire significant region of amplification or deletion.
#q values: The q-value of the peak region.
#Residual q values after removing segments shared with higher peaks : The q-value of the peak region after removing (“peeling off”) amplifications or deletions that overlap other more significant peak regions in the same chromosome.
#Broad or Focal: Identifies whether the region reaches significance due primarily to broad events (called “broad”), focal events (called “focal”), or independently significant broad and focal events (called “both”).
#Amplitude Threshold: Key giving the meaning of values in the subsequent columns associated with each sample.
#all-data-by_genes.txt=Gene Symbol, Gene ID (Entrez), Cytoband, SampleIDs
#To obtain these aforementioned files in appropriate format, we examine our previously generatd RangedSummarizedExperiment:
cnv_gistic_calls<-as.data.frame(assay(cnv_gistic))
#This is not appropriate format for CNVRanger functions. We can either create an appropriate dataframe and Genomic Ranges List Object using the
#gistic assay CNV gistic recurrent lesion regions calls matrix in appropriate format for further analysis OR we can try to download the file in appropriate format as follows:
query <- GDCquery(project = "TCGA-ACC",data.category = "Copy Number Variation",data.type = "Copy Number Segment",
barcode = c("TCGA-OR-A5J9-01A-11D-A29H-01","TCGA-OR-A5JE-01A-11D-A29H-01","TCGA-OR-A5JF-01A-11D-A29H-01","TCGA-OR-A5JI-01A-11D-A29H-01",
"TCGA-OR-A5K0-01A-11D-A29H-01","TCGA-OR-A5KV-01A-11D-A29H-01","TCGA-OR-A5L5-01A-11D-A29H-01","TCGA-OR-A5LC-01A-11D-A29H-01","TCGA-OR-A5LE-01A-11D-A29H-01","TCGA-OR-A5LL-01A-11D-A29H-01"),
sample.type = c("Primary Tumor"))
## --------------------------------------
## o GDCquery: Searching in GDC database
## --------------------------------------
## Genome of reference: hg38
## --------------------------------------------
## oo Accessing GDC. This might take a while...
## --------------------------------------------
## ooo Project: TCGA-ACC
## --------------------
## oo Filtering results
## --------------------
## ooo By data.type
## ooo By barcode
## ooo By sample.type
## ----------------
## oo Checking data
## ----------------
## ooo Checking if there are duplicated cases
## ooo Checking if there are results for the query
## -------------------
## o Preparing output
## -------------------
GDCdownload(query)
## Downloading data for project TCGA-ACC
## GDCdownload will download 10 files. A total of 338.142 KB
## Downloading as: Thu_Jul_11_03_27_05_2024.tar.gz
## Downloading: 8.2 kB Downloading: 8.2 kB Downloading: 8.2 kB Downloading: 8.2 kB Downloading: 8.2 kB Downloading: 8.2 kB Downloading: 33 kB Downloading: 33 kB Downloading: 33 kB Downloading: 33 kB Downloading: 41 kB Downloading: 41 kB Downloading: 57 kB Downloading: 57 kB Downloading: 66 kB Downloading: 66 kB Downloading: 74 kB Downloading: 74 kB Downloading: 81 kB Downloading: 81 kB Downloading: 81 kB Downloading: 81 kB Downloading: 81 kB Downloading: 81 kB
data <- GDCprepare(query)
## Reading copy number variation files
data
## # A tibble: 4,895 × 7
## GDC_Aliquot Chromosome Start End Num_Probes Segment_Mean Sample
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 726707fb-2431-4598-a… 1 6.29e4 1.88e6 304 -0.337 TCGA-…
## 2 726707fb-2431-4598-a… 1 1.88e6 4.67e6 1366 -1.14 TCGA-…
## 3 726707fb-2431-4598-a… 1 4.67e6 4.67e6 2 -7.19 TCGA-…
## 4 726707fb-2431-4598-a… 1 4.68e6 5.71e6 867 -1.06 TCGA-…
## 5 726707fb-2431-4598-a… 1 5.71e6 5.72e6 10 -2.47 TCGA-…
## 6 726707fb-2431-4598-a… 1 5.72e6 9.26e6 2014 -1.02 TCGA-…
## 7 726707fb-2431-4598-a… 1 9.27e6 1.69e7 4033 -0.215 TCGA-…
## 8 726707fb-2431-4598-a… 1 1.69e7 1.70e7 66 -0.691 TCGA-…
## 9 726707fb-2431-4598-a… 1 1.70e7 2.53e7 5229 -0.212 TCGA-…
## 10 726707fb-2431-4598-a… 1 2.53e7 2.53e7 24 0.399 TCGA-…
## # ℹ 4,885 more rows
#Get the last run dates
lastRunDate <- getFirehoseRunningDates()[1]
lastAnalyseDate <- getFirehoseAnalyzeDates(1)
#Download GISTIC results
gistic <- getFirehoseData("ACC",gistic2_Date = getFirehoseRunningDates()[1]) #"20141017"
## RTCGAToolbox cache directory set to:
## C:\Users\User\AppData\Local/R/cache/R/RTCGAToolbox
## Using locally cached version of C:\Users\User\AppData\Local/R/cache/R/RTCGAToolbox/20160128-ACC-Clinical.txt
# get GISTIC results
gistic.allbygene <- gistic@GISTIC@AllByGene
#gistic.thresholedbygene <- gistic@GISTIC@ThresholedByGene
#Error: no slot of name "ThresholedByGene" for this object of class "FirehoseGISTIC"
gistic.allbygene
## data frame with 0 columns and 0 rows
#FOR ULTIMATELY USING CNVRanger package to convert individual calls into the GISTIC recurrent regions lesions we obtained via miniACC,
#WE ARE NOT SUCCESSFULLY OBTAINING THE NECESSARY GISTIC FILES WITH DATA VIA THE TCGA TOOLS
#ALTERNATIVELY, WE TRY TO OBTAIN THE NECESSARY FILES VIA maftools package:
#With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widely accepted and used to store somatic variants detected.
#The Cancer Genome Atlas Project has sequenced over 30 different cancers with sample size of each cancer type being over 200.
#Resulting data consisting of somatic variants are stored in the form of Mutation Annotation Format (MAF):
gistic_res_folder <- system.file("extdata", package = "maftools")
laml.gistic = readGistic(gisticDir = gistic_res_folder, isTCGA = TRUE)
## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples
all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gistic <- system.file("extdata", "scores.gistic", package = "maftools")
laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, gisticScoresFile = scores.gistic, isTCGA = TRUE)
## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples
#gistic_maftools <- readGistic(gisticAllLesionsFile = "all_lesions.conf_99.txt",
# gisticAmpGenesFile = "amp_genes.conf_99.txt",
# gisticDelGenesFile = "del_genes.conf_99.txt",
# cnLevel = "all", gisticScoresFile = "scores.gistic")
#Error: File 'all_lesions.conf_99.txt' does not exist or is non-readable. getwd()=='C:/Users/User/Documents'
#There are three types of plots available to visualize gistic results:
#genome plot
gisticChromPlot(gistic = laml.gistic, markBands = "all")
#Co-gisticChromPlot
#Similarly, two GISTIC objects can be plotted side-by-side for cohort comparison. In this example, the same GISTIC object is used for demonstration.
coGisticChromPlot(gistic1 = laml.gistic, gistic2 = laml.gistic, g1Name = "AML-1", g2Name = "AML-2", type = 'Amp')
#oncoplot
#This is similar to oncoplots except for copy number data. One can again sort the matrix according to annotations, if any. Below plot is the gistic results for LAML, sorted according to FAB classification. Plot shows that 7q deletions are virtually absent in M4 subtype where as it is widespread in other subtypes.
#gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, top = 10)
#Error in h(simpleError(msg, call)) :
#error in evaluating the argument 'x' in selecting a method for function 'getClinicalData': object 'laml' not found
#Similar to MAF objects, there are methods available to access slots of GISTIC object - getSampleSummary, getGeneSummary and getCytoBandSummary.
#Summarized results can be written to output files using function write.GisticSummary.
#BECAUSE WE DID NOT OBTAIN THE NECESSARY ALL-CNV LESIONS FILE, WE WILL NOW ASSUME THAT OUR miniACC GISTIC matrix represents INDIVIDUAL CNV CALLS
#TO BE CONVERTED TO GISTIC RECURRENT REGIONS VIA CNVRANGER. BECAUSE WE WERE UNSUCCESSFUL IN PROCESSING THE ADDITIONAL CNV INDIVIDUAL CALLS RAGGED EXPERIMENT
#FROM TCGA, WE WILL INSTEAD TREAT THE GISTIC EXPERIMENT PROVIDED VIA miniACC AS QUANTIFICATION OF GENE-BASED INTEGER STATE COUNTS FOR RECURRING CNV LESIONS AND,
#TREAT INSTEAD THE GISTIC REGIONS AS THE "INDIVIDUAL CNV CALLS" THAT WE WILL THEN CONVERT INTO GENOMIC RANGE LIST OBJECT, READ IN BY CNV_RANGER,
#AND AGAIN PROCESS BY GISTIC2 TO YIELD THE STATISITCALLY SIGNIFICANT IDENTIFIED CHROMOSOME-WIDE RECURRENT REGIONS:
#CREATING AN INDIVIDUAL CALL-LIKE INPUT GENOMICRANGELIST OBJECT FOR CNVRAnger USING OUR TCGA GISTIC SUMMARIZED EXPERIMENT:
gensInfo_CNV<-getBM(attributes=c("hgnc_symbol","ensembl_gene_id","entrezgene_id","chromosome_name","start_position","end_position","description" ), filters=c("hgnc_symbol"), values=list(rownames(assay(mACC.CN3))), mart=ensembl102)
gensInfo_CNV$length <- gensInfo_CNV$end_position - gensInfo_CNV$start_position
range(gensInfo_CNV$length)
## [1] 2403 1216444
table(duplicated(gensInfo_CNV$hgnc_symbol))
##
## FALSE TRUE
## 197 15
gensInfo_CNV[duplicated(gensInfo_CNV$hgnc_symbol),]
## hgnc_symbol ensembl_gene_id entrezgene_id chromosome_name
## 2 ACACA ENSG00000278540 31 17
## 10 AKT3 ENSG00000117020 10000 1
## 51 CHGA ENSG00000100604 1113 14
## 53 CLDN7 ENSG00000181885 1366 17
## 63 EEF2K ENSG00000103319 29904 16
## 86 HSPA1A ENSG00000234475 3303 CHR_HSCHR6_MHC_DBB_CTG1
## 87 HSPA1A ENSG00000237724 3303 CHR_HSCHR6_MHC_COX_CTG1
## 88 HSPA1A ENSG00000215328 3303 CHR_HSCHR6_MHC_QBL_CTG1
## 89 HSPA1A ENSG00000204389 3303 6
## 113 MAPT ENSG00000276155 4137 CHR_HSCHR17_1_CTG5
## 114 MAPT ENSG00000186868 4137 17
## 122 MYH11 ENSG00000133392 4629 16
## 153 PTEN ENSG00000171862 5728 10
## 170 RPS6KA1 ENSG00000117676 6195 1
## 211 YWHAE ENSG00000108953 7531 17
## start_position end_position
## 2 37084992 37406836
## 10 243488233 243851079
## 51 92923150 92935285
## 53 7259903 7263983
## 63 22206278 22288738
## 86 31797650 31800132
## 87 31802834 31805316
## 88 31805699 31808181
## 89 31815543 31817946
## 113 46069784 46203150
## 114 45894551 46028334
## 122 15703135 15857028
## 153 87863625 87971930
## 170 26529761 26575030
## 211 1344275 1400222
## description
## 2 acetyl-CoA carboxylase alpha [Source:HGNC Symbol;Acc:HGNC:84]
## 10 AKT serine/threonine kinase 3 [Source:HGNC Symbol;Acc:HGNC:393]
## 51 chromogranin A [Source:HGNC Symbol;Acc:HGNC:1929]
## 53 claudin 7 [Source:HGNC Symbol;Acc:HGNC:2049]
## 63 eukaryotic elongation factor 2 kinase [Source:HGNC Symbol;Acc:HGNC:24615]
## 86 heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 87 heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 88 heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 89 heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 113 microtubule associated protein tau [Source:HGNC Symbol;Acc:HGNC:6893]
## 114 microtubule associated protein tau [Source:HGNC Symbol;Acc:HGNC:6893]
## 122 myosin heavy chain 11 [Source:HGNC Symbol;Acc:HGNC:7569]
## 153 phosphatase and tensin homolog [Source:HGNC Symbol;Acc:HGNC:9588]
## 170 ribosomal protein S6 kinase A1 [Source:HGNC Symbol;Acc:HGNC:10430]
## 211 tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein epsilon [Source:HGNC Symbol;Acc:HGNC:12851]
## length
## 2 321844
## 10 362846
## 51 12135
## 53 4080
## 63 82460
## 86 2482
## 87 2482
## 88 2482
## 89 2403
## 113 133366
## 114 133783
## 122 153893
## 153 108305
## 170 45269
## 211 55947
length(setdiff(rownames(assay(mACC.CN3)), gensInfo_CNV$hgnc_symbol))
## [1] 1
countsFDF_CNV <- data.frame(ID=rownames(assay(mACC.CN3)),assay(mACC.CN3))
countsFInfo_CNV <- right_join(countsFDF_CNV, gensInfo_CNV, by=c("ID"="hgnc_symbol"))
countsFInfo_CNV <- countsFInfo_CNV[!duplicated(countsFInfo_CNV$ID),] #After having checked duplications, just keep first result
countsFInfo_CNV_backup<-countsFInfo_CNV
colnames(countsFInfo_CNV_backup)[colnames(countsFInfo_CNV_backup) == 'chromosome_name'] <- 'chr'
colnames(countsFInfo_CNV_backup)[colnames(countsFInfo_CNV_backup) == 'start_position'] <- 'start'
colnames(countsFInfo_CNV_backup)[colnames(countsFInfo_CNV_backup) == 'end_position'] <- 'end'
#colnames(countsFInfo_CNV_backup)[colnames(countsFInfo_CNV_backup) == 'chromosome_name'] <- 'state'
#REPLACING THE FOLLOWING INCORRECTLY FORMATTED CHROMOSOME NAMES OBTAINED VIA BIOMART WITH THE CORRECTLY FORMATTED CHROMOSOME
#LOCATIONS FROM NCBI GENE DATABASE AND/OR UCSC GENOMIC BROWSER:
#14 RPS6KA1 CHR_HG2058_PATCH 26529761 26575030 = CHROMOSOME 1
#21 AKT3 CHR_HSCHR1_3_CTG32_1 243488233 243855434 = CHROMOSOME 1
#29 CLDN7 CHR_HG2087_PATCH 7259903 7263983 = CHROMOSOME 17
#36 PTEN CHR_HG2334_PATCH 87863440 87966341 = CHROMOSOME 10
#69 YWHAE CHR_HSCHR17_2_CTG2 1247054 1303157 = CHROMOSOME 17
#85 MAPT CHR_HSCHR17_2_CTG5 45906010 46039943 = CHROMOSOME 17
#102 ACACA CHR_HSCHR17_7_CTG4 37086456 37411442 = CHROMOSOME 17
#119 EEF2K CHR_HG926_PATCH 21992621 22075070 = CHROMOSOME 16
#147 MYH11 CHR_HSCHR16_1_CTG1 15788326 15942169 = CHROMOSOME 16
#179 HSPA1A CHR_HSCHR6_MHC_APD_CTG1 31882493 31884975 = CHROMOSOME 6
#208 CHGA CHR_HSCHR14_7_CTG1 92923080 92935293 = CHROMOSOME 14
#countsFInfo_CNV_backup %>% mutate(chr = ifelse(ID == "RPS6KA1", "1" , chr))
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "RPS6KA1", "chr"] <- "1"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "AKT3", "chr"] <- "1"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "CLDN7", "chr"] <- "17"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "PTEN", "chr"] <- "10"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "YWHAE", "chr"] <- "17"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "MAPT", "chr"] <- "17"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "ACACA", "chr"] <- "17"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "EEF2K", "chr"] <- "16"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "MYH11", "chr"] <- "16"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "HSPA1A", "chr"] <- "6"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "CHGA", "chr"] <- "14"
#Extracting and subsetting the mRNA-seq count matrix (e.g. filtered, No-NA, chromosome name-corrected, no-duplicate geneID row)
#from the summary dataframe countsFInfo_backup
rownames(countsFInfo_CNV_backup)<-countsFInfo_CNV_backup$ID
#PCA for CNV
countsFInfo_CNV_backup_PCAMFA<-countsFInfo_CNV_backup[,2:11]
#Transpose
countsFInfo_CNV_backup_PCAMFA.t<-t(countsFInfo_CNV_backup_PCAMFA)
# assign names, we include a cnv suffix to differentiate genes from micexp or exp
colnames(countsFInfo_CNV_backup_PCAMFA.t)<-paste(countsFInfo_CNV_backup$ID,"cnv",sep=".")
#Construct data.frame to perform PCA
cnv4pca<-data.frame(cond2,countsFInfo_CNV_backup_PCAMFA.t)
res.pca.cnv<-PCA(cnv4pca,quali.sup=1)
res.pca.cnv
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 10 individuals, described by 198 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$quali.sup" "results for the supplementary categorical variables"
## 12 "$quali.sup$coord" "coord. for the supplementary categories"
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"
## 14 "$call" "summary statistics"
## 15 "$call$centre" "mean of the variables"
## 16 "$call$ecart.type" "standard error of the variables"
## 17 "$call$row.w" "weights for the individuals"
## 18 "$call$col.w" "weights for the variables"
plot(res.pca.cnv,habillage=1)
#countsFInfo_CNV_backup
#We observe differences between the young and old patient samples (in dim 1 and dim2)
#FOR LATER CNV/mRNA-Seq EXPRESSION CNVRANGER-BASED CORRELATION ANALYSIS, EQUALIZE GENEIDs for BOTH FILTERED, LOG-TRANSFORMED,
#NORMALIZED mRNA COUNTS and GISTIC CNV RECURRENT LEGIONS PEAKS:
countsF_extracted<-as.matrix(countsFInfo_backup[,2:11])
rownames(countsF_extracted)<-countsFInfo_backup$ID
#Setting equal the sampleIDs:
#colnames(normalized_df.log)<-colnames(assay(mACC.CN3))
colnames(countsF_extracted)<-colnames(countsFInfo_CNV_backup[,2:11])
phenoN2<-phenoN
rownames(phenoN2)<-colnames(countsF_extracted)
phenoN2$sample<-colnames(countsF_extracted)
cond2<-phenoN2$age.status
#Checking NA
sum_na<-sum(is.na(countsF_extracted))
sum_na
## [1] 0
#I next normalize the mRNA-seq count matrix using DESeq2 and then transformed to log2:
#DESeq2 on COUNT MATRIX:
#Converting to integer to avoid error
countsF_extracted_int<-countsF_extracted
object.size(countsF_extracted_int)
## 27496 bytes
mode(countsF_extracted_int) <- "integer"
object.size(countsF_extracted_int)
## 20256 bytes
cds <- DESeqDataSetFromMatrix(countData = countsF_extracted_int,colData = phenoN2,design = ~ age.status)
dds <- estimateSizeFactors(cds)
normalized_df <- counts(dds, normalized=TRUE)
normalized_df_log <- log2(normalized_df+1)
#FILTERED, NO-NA, NORMALIZED, LOG-TRANSFORMED MATRIX READY TO BE PROCESSED
#######################################################
#Subset to ensure same gene set is later co-analyzed
countsFInfo_CNV_backup_sub<-countsFInfo_CNV_backup[ rownames(countsFInfo_CNV_backup) %in% rownames(normalized_df_log), ]
#NOT TRANSPOSED df1_new<-as.data.frame(t(df1))
df1<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5J9.01A.11D.A29H.01" )]
df1$sampleID<-"TCGA.OR.A5J9.01A.11D.A29H.01"
colnames(df1)<-c("ID","chr", "start", "end", "state", "sampleID")
df1<-df1[,c("ID","chr", "start", "end", "sampleID","state")]
df2<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5JE.01A.11D.A29H.01" )]
df2$sampleID<-"TCGA.OR.A5JE.01A.11D.A29H.01"
colnames(df2)<-c("ID","chr", "start", "end", "state", "sampleID")
df2<-df2[,c("ID","chr", "start", "end", "sampleID","state")]
df3<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5JF.01A.11D.A29H.01" )]
df3$sampleID<-"TCGA.OR.A5JF.01A.11D.A29H.01"
colnames(df3)<-c("ID","chr", "start", "end", "state", "sampleID")
df3<-df3[,c("ID","chr", "start", "end", "sampleID","state")]
df4<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5JI.01A.11D.A29H.01" )]
df4$sampleID<-"TCGA.OR.A5JI.01A.11D.A29H.01"
colnames(df4)<-c("ID","chr", "start", "end", "state", "sampleID")
df4<-df4[,c("ID","chr", "start", "end", "sampleID","state")]
df5<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5K0.01A.11D.A29H.01" )]
df5$sampleID<-"TCGA.OR.A5K0.01A.11D.A29H.01"
colnames(df5)<-c("ID","chr", "start", "end", "state", "sampleID")
df5<-df5[,c("ID","chr", "start", "end", "sampleID","state")]
df6<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5KV.01A.11D.A29H.01" )]
df6$sampleID<-"TCGA.OR.A5KV.01A.11D.A29H.01"
colnames(df6)<-c("ID","chr", "start", "end", "state", "sampleID")
df6<-df6[,c("ID","chr", "start", "end", "sampleID","state")]
df7<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5L5.01A.11D.A29H.01" )]
df7$sampleID<-"TCGA.OR.A5L5.01A.11D.A29H.01"
colnames(df7)<-c("ID","chr", "start", "end", "state", "sampleID")
df7<-df7[,c("ID","chr", "start", "end", "sampleID","state")]
df8<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5LC.01A.11D.A29H.01" )]
df8$sampleID<-"TCGA.OR.A5LC.01A.11D.A29H.01"
colnames(df8)<-c("ID","chr", "start", "end", "state", "sampleID")
df8<-df8[,c("ID","chr", "start", "end", "sampleID","state")]
df9<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5LE.01A.11D.A29H.01" )]
df9$sampleID<-"TCGA.OR.A5LE.01A.11D.A29H.01"
colnames(df9)<-c("ID","chr", "start", "end", "state", "sampleID")
df9<-df9[,c("ID","chr", "start", "end", "sampleID","state")]
df10<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5LL.01A.11D.A29H.01" )]
df10$sampleID<-"TCGA.OR.A5LL.01A.11D.A29H.01"
colnames(df10)<-c("ID","chr", "start", "end", "state", "sampleID")
df10<-df10[,c("ID","chr", "start", "end", "sampleID","state")]
CNV_calls<-rbind(df1, df2, df3,df4, df5, df6, df7, df8, df9, df10)
CNV_calls_sort<-CNV_calls[order(CNV_calls$ID,decreasing = FALSE), ]
#ADDING value to 2 to state to convert from GISTIC format to CNVRanger format:
#CNV_calls_sort_add<-apply(CNV_calls_sort,1,function(x) x["state"]+2)
CNV_calls_sort$state<-CNV_calls_sort[,6]+2
nrow(CNV_calls_sort)
## [1] 1800
rownames(CNV_calls_sort)<-c(1:nrow(CNV_calls_sort))
CNV_calls_sort_sort2<-CNV_calls_sort[order(CNV_calls_sort$chr,CNV_calls_sort$start), ]
rownames(CNV_calls_sort_sort2)<-c(1:nrow(CNV_calls_sort_sort2))
CNV_calls_sort_sort2<-CNV_calls_sort_sort2[, c("chr","start", "end","sampleID", "state","ID")]
CNV_calls_sort_sort2$chr<-paste0("chr",CNV_calls_sort_sort2$chr )
length(unique(CNV_calls_sort_sort2[,"sampleID"]))
## [1] 10
#We have genomic ranges object for genes in ENSEMBL format OR HSBC GENE ID FORMAT:
df_sel_gene<-countsFInfo_CNV_backup_sub[, c("chr","start","end", "ensembl_gene_id", "ID")]
df_sel_gene$strand="*"
df_sel_gene$score=1
df_sel_gene$chr<-paste0("chr", df_sel_gene$chr )
df_sel_gene<-df_sel_gene[, c("chr","start","end", "strand", "score", "ensembl_gene_id", "ID")]
gr_sel_gene<-makeGRangesFromDataFrame(df_sel_gene,keep.extra.columns=TRUE)
gr_sel_gene_hgnc<-gr_sel_gene
gr_sel_gene_ensembl<-gr_sel_gene
#split.field = "ensembl_gene_id"
#names.field = "ensembl_gene_id"
#ignore.strand=TRUE
#names(gr_sel_gene)<-mcols(gr_sel_gene)$ensembl_gene_id
#names(gr_sel_gene_hgnc)<-mcols(gr_sel_gene_hgnc)$ID
#Once read into an R data.frame, we group the calls by sample ID and convert them to a GRangesList.
#Each element of the list corresponds to a sample, and contains the genomic coordinates of the CNV calls for this sample
#(along with the copy number state in the state metadata column)
grl <- GenomicRanges::makeGRangesListFromDataFrame(CNV_calls_sort_sort2, split.field="sampleID", keep.extra.columns=TRUE)
grl <- GenomicRanges::sort(grl)
grl
## GRangesList object of length 10:
## $TCGA.OR.A5J9.01A.11D.A29H.01
## GRanges object with 180 ranges and 2 metadata columns:
## seqnames ranges strand | state ID
## <Rle> <IRanges> <Rle> | <numeric> <character>
## [1] chr1 7954291-7985505 * | 1 PARK7
## [2] chr1 8004404-8026309 * | 1 ERRFI1
## [3] chr1 11106535-11262551 * | 2 MTOR
## [4] chr1 15490832-15526534 * | 2 CASP9
## [5] chr1 25884181-25906991 * | 2 STMN1
## ... ... ... ... . ... ...
## [176] chrX 47561100-47571920 * | 3 ARAF
## [177] chrX 48574449-48581162 * | 3 RBM3
## [178] chrX 49187815-49200199 * | 3 SYP
## [179] chrX 123859724-123913976 * | 3 XIAP
## [180] chrX 154531391-154547572 * | 3 G6PD
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
##
## ...
## <9 more elements>
#Specifically developed for CNV calls inferred from SNP-chip data, r Biocpkg("CNVRanger") allows to carry out a probe-level genome-wide association study (GWAS)
#with quantitative phenotypes. As previously described da Silva et al., 2016, we construct CNV segments from probes representing common CN polymorphisms (allele frequency >1\%), and carry out a GWAS as implemented in PLINK using a standard linear regression of phenotype on allele dosage.
#For CNV segments composed of multiple probes, the segment p-value is chosen from the probe p-values, using either the probe with minimum p-value or the probe with maximum CNV frequency.
#For compatibility with PLINK's fam file format, we create another phenotype information dataframe containing four columns representing patient traits from our MultiAssayExperiment
phenoN4 <- data.frame(sample.id=colnames(assay(mACC.CN3)),fam=colData(miniACC.assays.comp.age)$race,sex=colData(miniACC.assays.comp.age)$gender, age.status=colData(miniACC.assays.comp.age)$years_to_birth)
#We combine the GISTIC CNV recurrent lesions peak "calls" with the phenotype information in a RaggedExperiment for coordinated representation and analysis:
re_gwas <- RaggedExperiment::RaggedExperiment(grl, colData=phenoN4)
re_gwas
## class: RaggedExperiment
## dim: 1800 10
## assays(2): state ID
## rownames: NULL
## colnames(10): TCGA.OR.A5J9.01A.11D.A29H.01 TCGA.OR.A5JE.01A.11D.A29H.01
## ... TCGA.OR.A5LE.01A.11D.A29H.01 TCGA.OR.A5LL.01A.11D.A29H.01
## colData names(4): sample.id fam sex age.status
#Given a RaggedExperiment storing CNV calls together with phenotype information, and optionally a map file for probe-level coordinates,
#the setupCnvGWAS function sets up all files needed for the GWAS analysis. The information required for analysis is stored in the resulting phen.info list:
#phen.info <- setupCnvGWAS("example", cnv.out.loc=re_gwas)
#phen.info
#Error in cnv.p.df[, 3] : subscript out of bounds
#In addition: There were 50 or more warnings (use warnings() to see the first 50)
#warnings()
#1: In .merge_two_Seqinfo_objects(x, y) :
#The 2 combined objects have no sequence levels in common. (Use suppressWarnings() to suppress this warning.)
#The last item of the list displays the working directory:
#all.paths <- phen.info$all.paths
#all.paths
#For the GWAS, chromosome names are assumed to be integer (i.e. 1, 2, 3, ...).
#We can then run the actual CNV-GWAS, here without correction for multiple testing which is done for demonstration only.
#In real analyses, multiple testing correction is recommended to avoid inflation of false positive findings.
#segs.pvalue.gr <- cnvGWAS(phen.info, chr.code.name=chr.code.name, method.m.test="none")
#segs.pvalue.gr
#DUE TO ERROR STATEMENT, WE NEED TO FOREGO EXECUTION OF cnvGWAS() method
#In CNV analysis, it is often of interest to summarize individual calls across the population, (i.e. to define CNV regions),
#for subsequent association analysis with expression and phenotype data. In the simplest case, this just merges overlapping individual
#calls into summarized regions.We will use GISTIC process:By setting est.recur=TRUE, we deploy a GISTIC-like significance estimation
cnvrs <- populationRanges(grl, density=0.1, est.recur=TRUE)
## Excluding 976 copy-number neutral regions (CN state = 2, diploid)
cnvrs
## GRanges object with 180 ranges and 3 metadata columns:
## seqnames ranges strand | freq type pvalue
## <Rle> <IRanges> <Rle> | <numeric> <character> <numeric>
## [1] chr1 7954291-7985505 * | 5 loss 0.0
## [2] chr1 8004404-8026309 * | 5 loss 0.0
## [3] chr1 11106535-11262551 * | 4 loss 0.1
## [4] chr1 15490832-15526534 * | 3 loss 0.3
## [5] chr1 25884181-25906991 * | 4 loss 0.1
## ... ... ... ... . ... ... ...
## [176] chrX 47561100-47571920 * | 8 both 0.000000
## [177] chrX 48574449-48581162 * | 8 both 0.000000
## [178] chrX 49187815-49200199 * | 8 both 0.000000
## [179] chrX 123859724-123913976 * | 7 both 0.452381
## [180] chrX 154531391-154547572 * | 7 both 0.452381
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
#plotRecurrentRegions(regs, genome, chr, pthresh = 0.05)
#We filter for recurrent CNVs that exceed a significance threshold of 0.05.
subset(cnvrs, pvalue < 0.05)
## GRanges object with 26 ranges and 3 metadata columns:
## seqnames ranges strand | freq type pvalue
## <Rle> <IRanges> <Rle> | <numeric> <character> <numeric>
## [1] chr1 7954291-7985505 * | 5 loss 0
## [2] chr1 8004404-8026309 * | 5 loss 0
## [3] chr1 110338506-110346681 * | 5 loss 0
## [4] chr1 114704469-114716771 * | 5 loss 0
## [5] chr5 52989340-53094779 * | 7 gain 0
## ... ... ... ... . ... ... ...
## [22] chr22 29603556-29698598 * | 5 loss 0
## [23] chr22 36281280-36387967 * | 5 loss 0
## [24] chrX 47561100-47571920 * | 8 both 0
## [25] chrX 48574449-48581162 * | 8 both 0
## [26] chrX 49187815-49200199 * | 8 both 0
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
#GRanges object with 28 ranges and 3 metadata columns:
#We illustrate the landscape of recurrent CNV regions using the function plotRecurrentRegions.
#We therefore provide the summarized CNV regions, a valid UCSC genome assembly, and a chromosome of interest.
plotRecurrentRegions(cnvrs, genome="hg19", chr="chr1")
plotRecurrentRegions(cnvrs, genome="hg19", chr="chr22")
plotRecurrentRegions(cnvrs, genome="hg19", chr="chr5")
plotRecurrentRegions(cnvrs, genome="hg19", chr="chrX")
sel.genes <- subset(gr_sel_gene, seqnames %in% paste0("chr", 1:2))
sel.genes_hgnc <- subset(gr_sel_gene_hgnc, seqnames %in% paste0("chr", 1:2))
sel.cnvrs <- subset(cnvrs, seqnames %in% paste0("chr", 1:2))
#The findOverlaps function from the GenomicRanges package is a general function for finding overlaps between two sets of genomic regions.
#Here, we use the function to find protein-coding genes overlapping the summarized CNV regions.
#Resulting overlaps are represented as a Hits object, from which overlapping query and subject regions can be obtained with dedicated accessor
#functions (named queryHits and subjectHits, respectively). Here, we use these functions to also annotate the CNV type (gain/loss) for genes overlapping with CNVs.
olaps <- GenomicRanges::findOverlaps(sel.genes, sel.cnvrs, ignore.strand=TRUE)
qh <- S4Vectors::queryHits(olaps)
sh <- S4Vectors::subjectHits(olaps)
cgenes <- sel.genes[qh]
cgenes$type <- sel.cnvrs$type[sh]
subset(cgenes, select = "type")
## GRanges object with 30 ranges and 1 metadata column:
## seqnames ranges strand | type
## <Rle> <IRanges> <Rle> | <character>
## DIRAS3 chr1 68045886-68051631 * | loss
## IGFBP2 chr2 216632828-216664436 * | gain
## RPS6KA1 chr1 26529761-26575030 * | loss
## FN1 chr2 215360440-215436073 * | gain
## BCL2L11 chr2 111119378-111168445 * | gain
## ... ... ... ... . ...
## ERRFI1 chr1 8004404-8026309 * | loss
## PARP1 chr1 226360210-226408154 * | both
## CASP9 chr1 15490832-15526534 * | loss
## MSH2 chr2 47403067-47663146 * | both
## CASP8 chr2 201233443-201287711 * | gain
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
#GRanges object with 33 ranges and 1 metadata column:
olaps_hgnc <- GenomicRanges::findOverlaps(sel.genes_hgnc, sel.cnvrs, ignore.strand=TRUE)
qh_hgnc <- S4Vectors::queryHits(olaps_hgnc)
sh_hgnc <- S4Vectors::subjectHits(olaps_hgnc)
cgenes_hgnc <- sel.genes_hgnc[qh_hgnc]
cgenes_hgnc$type <- sel.cnvrs$type[sh_hgnc]
subset(cgenes_hgnc, select = "type")
## GRanges object with 30 ranges and 1 metadata column:
## seqnames ranges strand | type
## <Rle> <IRanges> <Rle> | <character>
## DIRAS3 chr1 68045886-68051631 * | loss
## IGFBP2 chr2 216632828-216664436 * | gain
## RPS6KA1 chr1 26529761-26575030 * | loss
## FN1 chr2 215360440-215436073 * | gain
## BCL2L11 chr2 111119378-111168445 * | gain
## ... ... ... ... . ...
## ERRFI1 chr1 8004404-8026309 * | loss
## PARP1 chr1 226360210-226408154 * | both
## CASP9 chr1 15490832-15526534 * | loss
## MSH2 chr2 47403067-47663146 * | both
## CASP8 chr2 201233443-201287711 * | gain
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
#GRanges object with 33 ranges and 1 metadata column:
#We illustrate the original CNV calls on overlapping genomic features (here: protein-coding genes).
#For this purpose, an oncoPrint plot provides a useful summary in a rectangular fashion (genes in the rows, samples in the columns).
#Stacked barplots on the top and the right of the plot display the number of altered genes per sample and the number of altered samples per gene, respectively.
cnvOncoPrint(grl, cgenes)
cnvOncoPrint(grl, cgenes_hgnc)
#Overlap permutation test
#As a certain amount of overlap can be expected just by chance, an assessment of statistical significance is needed to decide whether the observed overlap
#is greater (enrichment) or less (depletion) than expected by chance.The regioneR package implements a general framework for testing overlaps of genomic regions
#based on permutation sampling. This allows to repeatedly sample random regions from the genome, matching size and chromosomal distribution of the region set under
#study (here: the CNV regions). By recomputing the overlap with the functional features in each permutation, statistical significance of the observed overlap
#can be assessed.We demonstrate in the following how this strategy can be used to assess the overlap between the detected CNV regions and protein-coding regions
#in the human genome. We expect to find a depletion as protein-coding regions are highly conserved and rarely subject to long-range structural variation such as CNV.
#Hence, is the overlap between CNVs and protein-coding genes less than expected by chance?To answer this question, we apply an overlap permutation test
#with 100 permutations (ntimes=100), while maintaining chromosomal distribution of the CNV region set (per.chromosome=TRUE).
#Furthermore, we use the option count.once=TRUE to count an overlapping CNV region only once, even if it overlaps with 2 or more genes.
#We also allow random regions to be sampled from the entire genome (mask=NA), although in certain scenarios masking certain regions such
#as telomeres and centromeres is advisable. Also note that we use 100 permutations for demonstration only.
#To draw robust conclusions a minimum of 1000 permutations should be carried out.
#BSgenome.Hsapiens.UCSC.hg38, except that each of them has the 4 following masks on top:
#(1) the mask of assembly gaps (AGAPS mask), (2) the mask of intra-contig ambiguities (AMB mask),
#(3) the mask of repeats from RepeatMasker (RM mask), and (4) the mask of repeats from Tandem Repeats Finder (TRF mask).
#Only the AGAPS and AMB masks are "active" by default. The sequences are stored in MaskedDNAString objects.
res <- regioneR::overlapPermTest(A=sel.cnvrs, B=sel.genes, ntimes=100, genome="hg38", mask=NA, per.chromosome=TRUE, count.once=TRUE)
res
## $numOverlaps
## P-value: 0.0099009900990099
## Z-score: 56.8712
## Number of iterations: 100
## Alternative: greater
## Evaluation of the original region set: 30
## Evaluation function: numOverlaps
## Randomization function: randomizeRegions
##
## attr(,"class")
## [1] "permTestResultsList"
summary(res[[1]]$permuted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 0.3 1.0 2.0
#The resulting permutation p-value indicates a significant depletion. Out of the 197 CNV regions (cnvrs object),
#33 overlap with at least one gene.
plot(res)
#RE-attempting with entire gene set(not just chromosomes 1 and 2):
res2 <- regioneR::overlapPermTest(A=cnvrs, B=gr_sel_gene, ntimes=100, genome="hg38", mask=NA, per.chromosome=TRUE, count.once=TRUE)
res2
## $numOverlaps
## P-value: 0.0099009900990099
## Z-score: 104.4132
## Number of iterations: 100
## Alternative: greater
## Evaluation of the original region set: 180
## Evaluation function: numOverlaps
## Randomization function: randomizeRegions
##
## attr(,"class")
## [1] "permTestResultsList"
summary(res2[[1]]$permuted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 2.00 2.42 3.00 8.00
plot(res2)
#A more pronounced peak became apparent
mRNA-Seq AND GISTIC CNV DATA BLOCK CORRELATION ANALYSIS
#Studies of expression quantitative trait loci (eQTLs) aim at the discovery of genetic variants that explain variation in gene expression levels
#(Nica and Dermitzakis, 2013). Mainly applied in the context of SNPs, the concept also naturally extends to the analysis of CNVs.
#The CNVRanger package implements association testing between CNV regions and RNA-seq read counts using edgeR,
#which applies generalized linear models based on the negative-binomial distribution while incorporating normalization factors for different library sizes.
#In the case of only one CN state deviating from 2n for a CNV region under investigation, this reduces to the classical 2-group comparison.
#For more than two states (e.g. 0n, 1n, 2n), edgeR’s ANOVA-like test is applied to test all deviating groups
#for significant expression differences relative to 2n.
#Assuming distinct modes of action, effects observed in the CNV-expression analysis are typically divided into (i) local effects (cis),
#where expression changes coincide with CNVs in the respective genes, and (ii) distal effects (trans), where CNVs supposedly affect trans-acting regulators
#such as transcription factors.However, due to power considerations and to avoid detection of spurious effects, stringent filtering of
#(i) not sufficiently expressed genes, and (ii) CNV regions with insufficient sample size in groups deviating from 2n, should be carried out
#when testing for distal effects. Local effects have a clear spatial indication and the number of genes locating in or close to a CNV region of
#interest is typically small; testing for differential expression between CN states is thus generally better powered for local effects
#and less stringent filter criteria can be applied.In the following, we carry out CNV-expression association analysis by providing the
#CNV regions to test (cnvrs), the individual CNV calls (grl) to determine per-sample CN state in each CNV region, the RNA-seq read counts (rse),
#and the size of the genomic window around each CNV region (window). The window argument thereby determines which genes are considered for testing
#for each CNV region and is set here to 1 Mbp.Further, use the filter.by.expr and min.samples arguments to exclude from the analysis
#(i) genes with very low read count across samples, and (ii) CNV regions with fewer than min.samples samples in a group deviating from 2n.
rcounts<-normalized_df_log
rcounts<-rcounts[rownames(rcounts) %in% rownames(df_sel_gene),]
#traceback()
#RENAME SAMPLEID NAMES FOR ALL OBJECTS:
test<-gr_sel_gene_hgnc
#names(gr_sel_gene_ensembl)<-mcols(gr_sel_gene_ensembl)$ensembl_gene_id
#names(gr_sel_gene_hgnc)<-mcols(gr_sel_gene_hgnc)$ID
rranges <- GenomicRanges::granges(test)[rownames(rcounts)]
rse <- SummarizedExperiment(assays=list(rcounts=rcounts), rowRanges=rranges)
rse
## class: RangedSummarizedExperiment
## dim: 180 10
## metadata(0):
## assays(1): rcounts
## rownames(180): DIRAS3 MAPK14 ... IDH3A SQSTM1
## rowData names(0):
## colnames(10): TCGA.OR.A5J9.01A.11D.A29H.01 TCGA.OR.A5JE.01A.11D.A29H.01
## ... TCGA.OR.A5LE.01A.11D.A29H.01 TCGA.OR.A5LL.01A.11D.A29H.01
## colData names(0):
res <- cnvEQTL(cnvrs, grl, rse, min.samples=1,window = "1Mbp", verbose = TRUE)
## Restricting analysis to 10 intersecting samples
## Preprocessing RNA-seq data ...
## Summarizing per-sample CN state in each CNV region
## Excluding 45 cnvrs not satisfying min.samples threshold
## Analyzing 35 regions with >=1 gene in the given window
## 1 of 35
## 2 of 35
## 3 of 35
## 4 of 35
## 5 of 35
## 6 of 35
## 7 of 35
## 8 of 35
## 9 of 35
## 10 of 35
## 11 of 35
## 12 of 35
## 13 of 35
## 14 of 35
## 15 of 35
## 16 of 35
## 17 of 35
## 18 of 35
## 19 of 35
## 20 of 35
## 21 of 35
## 22 of 35
## 23 of 35
## 24 of 35
## 25 of 35
## 26 of 35
## 27 of 35
## 28 of 35
## 29 of 35
## 30 of 35
## 31 of 35
## 32 of 35
## 33 of 35
## 34 of 35
## 35 of 35
#The resulting GRangesList contains an entry for each CNV region tested, storing the genes tested in the genomic window around the CNV region,
#and (i) log2 fold change with respect to the 2n group, (ii) edgeR's DE p-value, and (iii) the (per default) Benjamini-Hochberg adjusted p-value.
#We can illustrate differential expression of genes in the neighborhood of a CNV region of interest using the function plotEQTL.
#The following regions are able to be graphically depicted: 1,2,3,4,8,9,13,16,23,34,35
res[2]
## GRangesList object of length 1:
## $`chr1:8004404-8026309`
## GRanges object with 1 range and 4 metadata columns:
## seqnames ranges strand | logFC.CN1 logFC.CN3 PValue
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## PARK7 chr1 7954291-7985505 * | -0.0526768 NA 0.237278
## AdjPValue
## <numeric>
## PARK7 0.37569
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
r <- GRanges(names(res)[2])
plotEQTL(cnvr=r, genes=res[[2]], genome="hg19", cn="CN1")
###########################################CORRELATION OF RAW mRNA-Seq and GISTIC CNV DATA ACROSS ALL PATIENTS##########################
mRNA_expr<-miniACC.assays.comp.age.cnvcalls.ranges[[2]]
cnv_gistic<-miniACC.assays.comp.age.cnvcalls.ranges[[3]]
cnv_gistic_assay<-assay(cnv_gistic)
mRNA_expr_assay<-assay(mRNA_expr)
colnames(cnv_gistic_assay)==colnames(mRNA_expr_assay)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
colnames(cnv_gistic_assay)<-colnames(mRNA_expr_assay)
# Let's correlate first gene (first row):
plot(log2(mRNA_expr_assay[1,]),cnv_gistic_assay[1,])
cor.test(log2(mRNA_expr_assay[1,]),cnv_gistic_assay[1,], method="spearman")
## Warning in cor.test.default(log2(mRNA_expr_assay[1, ]), cnv_gistic_assay[1, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: log2(mRNA_expr_assay[1, ]) and cnv_gistic_assay[1, ]
## S = 59, p-value = 0.05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.64
#0.6396021 is lower correlation R2 coefficient than firebrowse:0.8455
mRNA_expr_assay_m <- as.matrix(mRNA_expr_assay[,1:10])
cnv_gistic_assay_m <- as.matrix(cnv_gistic_assay[,1:10])
rownames(mRNA_expr_assay_m)<-rownames(mRNA_expr_assay)
rownames(cnv_gistic_assay_m)<-rownames(cnv_gistic_assay)
#Determining how data are distributed for first gene (Should be matrix?)
hist(as.numeric(mRNA_expr_assay[1,1:10]))
hist(as.numeric(cnv_gistic_assay[1,1:10]))
#Determining how many other genes are strongly correlated between mRNA and CN assay omic data sets:
cors <- diag(cor(t(mRNA_expr_assay_m),t(cnv_gistic_assay_m),method="pearson"))
cors.sign <- cors[abs(cors)>0.67 & !is.na(cors)]
cors.sign #12 genes
## [1] -0.793 0.782 -0.694 0.719 0.707 0.733 0.674 0.694 0.714 0.705
## [11] 0.819 0.691
cor_set<-mRNA_expr_assay_m[c(42,68,69,98,112,114,116,138,145,158,184,190),]
rownames(cor_set)
## [1] "ATM" "ACVRL1" "TSC1" "GSK3A" "KEAP1" "XRCC1" "NFKB1" "NF2"
## [9] "MYH9" "YWHAB" "MSH2" "DIABLO"
#These correspond to 12 genes that were strongly correlated between mRNA and CN assay omic data sets:
##The mRNA-seq RAW, UNFILTERED, NON-LOG TRANSFORMED, NON-NORMALIZED expression levels and GISTIC CNV copy number of the 12 genes
#"ATM" "ACVRL1" "TSC1" "GSK3A" "KEAP1" "XRCC1" "NFKB1" "NF2" "MYH9" "YWHAB" "MSH2" "DIABLO" are significantly correlated across ALL patients
#Plotting these 12 genes
op <- par(mfrow=c(2,2))
#[1] "character"
for (i in 1:length(cors.sign)){
#gene <- paste(gene, (rownames(mRNA_expr_assay)[i]), sep =" ")
gene <- names(cors.sign)[i]
#x = as.numeric(mRNA_expr_assay_m[gene,])
#y = as.numeric(cnv_gistic_assay_m[gene,])
#plot(x,y, main=gene, cex.main=0.8)
#fit <- lm(y ~ x)
#abline(fit, col="chartreuse3")
}
#Because miRNA data set is higher number of gene target rows compared to identical CN and mRNA dataset, it will not be used for correlation analysis
###############################CORRELATION BETWEEN FILTERED, TPM-NORMALIZED, LOG-TRANSFORMED mRNA-Seq and GISTIC CNV ACROSS YOUNG AND OLD PATIENT GROUPS######
#INDEXES:
#YOUNG PATIENTS=1,2,4,6,9
#OLD PATIENTS=3,5,7,8,10
#YOUNG PATIENTS
#Determining how data are distributed for first gene for young patients(Should be matrix?)
hist(as.numeric(mRNA_expr_assay[1,c(1,2,4,6,9)]))
hist(as.numeric(cnv_gistic_assay[1,c(1,2,4,6,9)]))
#Determining how many other genes are strongly correlated between mRNA and CNV assay omic data sets across YOUNG PATIENTS:
cors.young <- diag(cor(t(mRNA_expr_assay_m[,c(1,2,4,6,9)]),t(cnv_gistic_assay_m[,c(1,2,4,6,9)]),method="pearson"))
## Warning in cor(t(mRNA_expr_assay_m[, c(1, 2, 4, 6, 9)]), t(cnv_gistic_assay_m[,
## : the standard deviation is zero
cors.young.sign <- cors.young[abs(cors.young)>0.67 & !is.na(cors.young)]
cors.young.sign
## [1] 0.698 0.970 0.725 0.978 0.828 0.727 -0.685 -0.902 0.705 -0.720
## [11] 0.939 -0.986 -0.795 0.859 -0.931 0.757 0.811 0.923 0.933 0.696
## [21] 0.985 -0.799 0.751 0.802 0.860 0.704 0.680 0.685 0.919 -0.983
## [31] 0.697 -0.773 0.807 0.905 0.695 0.680 0.959 0.909 0.805 0.816
## [41] 0.785 0.691 0.963 0.731 0.686 0.872 0.811 0.697 0.746 0.843
length(cors.young.sign) #50 genes
## [1] 50
#cor_set_young<-mRNA_expr_assay_m[c(),]
#rownames(cor_set_young)
##The mRNA-seq RAW, UNFILTERED, NON-NORMALIZED, NON-LOG TRANSFORMED expression levels and GISTIC CNV copy number of 50 genes are
#significantly correlated across the 5 selected young patients
#OLD PATIENTS
#Determining how data are distributed for first gene for old patients(Should be matrix?)
hist(as.numeric(mRNA_expr_assay[1,c(3,5,7,8,10)]))
hist(as.numeric(cnv_gistic_assay[1,c(3,5,7,8,10)]))
#Determining how many other genes are strongly correlated between mRNA and CN assay omic data sets across YOUNG PATIENTS:
cors.old <- diag(cor(t(mRNA_expr_assay_m[,c(3,5,7,8,10)]),t(cnv_gistic_assay_m[,c(3,5,7,8,10)]),method="pearson"))
cors.old.sign <- cors.old[abs(cors.old)>0.67 & !is.na(cors.old)]
cors.old.sign
## [1] 0.803 -0.913 0.713 0.794 0.717 0.813 -0.823 -0.707 0.979 0.786
## [11] 0.687 0.819 -0.712 -0.795 0.817 -0.846 0.980 -0.767 0.888 -0.909
## [21] 0.828 -0.785 0.784 0.763 0.739 0.673 0.739 0.696 0.923 0.993
## [31] -0.680 0.769 0.800 0.772 0.670 0.686 0.802 0.833 0.813 0.816
## [41] -0.688 0.860 -0.914 0.891
length(cors.old.sign) # 44 genes
## [1] 44
#cor_set_old<-mRNA_expr_assay_m[c(),]
#rownames(cor_set_old)
##The mRNA-seq RAW, UNFILTERED, NON-LOG TRANSFORMED, NON-NORMALIZED expression levels and GISTIC CNV copy number of 44 genes are
#significantly correlated across the 5 selected OLD patients
########################################MFA ON FILTERED, TPM-NORMALIZED, LOG-TRANSFORMED mRNA-SEQ AND GISTIC CNV DATA##############################
# GISTIC CNV
countsFInfo_CNV_backup_MFA<-countsFInfo_CNV_backup[,2:11]
# transpose
countsFInfo_CNV_backup_MFA.t<-t(countsFInfo_CNV_backup_MFA)
# assign names, we include a suffix to differentiate genes from expression
colnames(countsFInfo_CNV_backup_MFA.t)<-paste(countsFInfo_CNV_backup$ID,"cnv",sep=".")
#mRNA Expression
countsF_TPM_LOG_DF_MFA <- countsF_TPM_LOG_DF[,1:10]
colnames(countsF_TPM_LOG_DF_MFA) <- colnames(countsFInfo_CNV_backup_MFA) #To perform later MFA, we need to have the same names
# transpose
countsF_TPM_LOG_DF_MFA.t<-t(countsF_TPM_LOG_DF_MFA)
# assign names, we include a suffix to differentiate genes from cnv
colnames(countsF_TPM_LOG_DF_MFA.t)<-paste(countsF_TPM_LOG_DF$ID,"mRNAexp",sep=".")
#miRNA Expression
countsF_TPM_LOG_DF_micro_MFA<-countsF_TPM_LOG_DF_micro[,1:10]
colnames(countsF_TPM_LOG_DF_micro_MFA) <- colnames(countsFInfo_CNV_backup_MFA)
# transpose
countsF_TPM_LOG_DF_micro_MFA.t<-t(countsF_TPM_LOG_DF_micro_MFA)
# Assign names, we include a suffix to differentiate genes from cnv
colnames(countsF_TPM_LOG_DF_micro_MFA.t)<-paste(countsF_TPM_LOG_DF_micro$ID,"miRNAexp",sep=".")
mRNAexp.l<-nrow(countsF_TPM_LOG_DF_MFA )
cnv.l<-nrow(countsFInfo_CNV_backup_MFA )
dat4Facto<-data.frame(cond=as.factor(cond2),countsF_TPM_LOG_DF_MFA.t,countsFInfo_CNV_backup_MFA.t)
dim(dat4Facto)
## [1] 10 379
es = MFA(dat4Facto, group=c(1,mRNAexp.l,cnv.l), type=c("n",rep("c",2)), ncp=5, name.group=c("cond2","mRNAexp","cnv"),num.group.sup=c(1))
#top correlated genes with first dimension (all of them come from the expression block)
top10.1 <- sort(es$global.pca$var$cor[,"Dim.1"],decreasing=TRUE)[1:10]
top10.1
## CDKN1B.cnv ERBB3.cnv ACVRL1.cnv RICTOR.cnv ACACB.cnv GAPDH.cnv TUBA1B.cnv
## 0.964 0.964 0.964 0.964 0.964 0.964 0.964
## KRT5.cnv KRAS.cnv FOXM1.cnv
## 0.964 0.964 0.964
#top correlated genes with second dimension (all of them come from the CN block)
top10.2 <- sort(es$global.pca$var$cor[,"Dim.2"],decreasing=TRUE)[1:10]
top10.2
## PRKCA.mRNAexp SQSTM1.mRNAexp YWHAE.cnv PIK3R1.mRNAexp SRC.mRNAexp
## 0.894 0.886 0.878 0.851 0.811
## MAPK3.cnv PRRT2.cnv EEF2K.cnv MYH11.cnv AKT3.mRNAexp
## 0.796 0.796 0.796 0.796 0.793
mRNA-Seq AND mRNA-Seq DATA BLOCK CORRELATION ANALYSIS
#Correlations between the significative miRNAs and their significative targets obtained by TargetScan will be evaluated.
#Correlations are measured and also some plots are generated on your hard disk. We will in general select those inversely
#correlated miRNAs and genes with a correlation Rho < -0.5 or 0.67
x_rna_backup<-x_rna
x_rna_backup<-as.matrix(x_rna_backup)
x_micro_backup<-x_micro
x_micro_backup<-as.matrix(x_micro_backup)
colnames(x_rna_backup)<-colnames(x_micro_backup)
mRNA.res2<-assay(mACC.exp3)
mRNA.res2<-as.data.frame(mRNA.res2)
mRNA.res2$Symbol<-rownames(mRNA.res2)
miRNA.res.hsa2<-assay(mACC.mir3)
miRNA.res.hsa2<-as.data.frame(miRNA.res.hsa2)
miRNA.res.hsa2$miRNA<-rownames(miRNA.res.hsa2)
#Correlations between the significative miRNAs and their significative targets obtained by TargetScan.
#Correlations are measured and also dot plots with regression lines are generated on your hard disk.
#Then, we will correct the p-values using FDR but we will in this case select those inversely correlated miRNAs and genes
#with a correlation Rho < -0.5 and a p-value < 0.05 to obtain more results.
resultsComb<-"./ResultsComb"
if(!dir.exists(resultsComb)) dir.create(resultsComb)
#cols<-as.vector(car::recode(pData(my.targets)$Cond,"'chord' ='green';'notochord' ='blue';"))
#pchs<-as.vector(car::recode(pData(my.targets)$Cond, "'chord' =16;'notochord' =17;"))
miRNAs<-miRNA.res.hsa2$miRNA
mRNAs<-mRNA.res2$Symbol
miRNACorrel<-function(res.miRNA,res.mRNA,data.miRNA,data.mRNA,resultsDir){
#Function that looks for targets from a list of miRNAs and
#returns a pdf with regression lines and a summary xls with correlations
#needs funcions miRNAGenes defined previously
miRNAs<-res.miRNA$miRNA
mRNAs<-res.mRNA$Symbol
for (i in miRNAs){
miRNA.genes<-miRNAGenes(i)
miRNA.genes.deg<-intersect(miRNA.genes,mRNAs)
#correlations
lng<-length(miRNA.genes.deg)
if (lng>0){
cor.rho<-array(NA,lng)
cor.pval<-array(NA,lng)
miRNA.id<-rownames(res.miRNA[res.miRNA$miRNA==i,])
y=as.vector(data.miRNA[miRNA.id,])
#pdf(file.path(resultsComb, paste0(miRNA.id,".corr.mRNA.miRNA.pdf")))
for (j in 1:lng){
mRNA<-miRNA.genes.deg[j]
mRNA.id<-rownames(res.mRNA[!is.na(res.mRNA$Symbol) & res.mRNA$Symbol==mRNA,])[1]
x=as.vector(data.mRNA[mRNA.id,])
cor<-cor.test(x,y, method = "spearman",exact=FALSE)
cor.pval[j]<-cor$p.value
cor.rho[j]<-cor$estimate
#we will plot just those combinations having a p.value<0.05 and a regression coef above 0.5 (positive or negative)
#if (cor$p.value < 0.05 & cor$estimate<(-0.5)){
#plot(x, y, main=mRNA,
#xlab="log2RMA expression",
#ylab="log2miRMA expression",
#type="p",
#xlim=c(0,16),
#ylim=c(0,16),
#col=cols,
#pch=pchs,
#cex=0.8)
#fit <- lm(y ~ x)
#abline(fit, col="chartreuse3",xlim=c(0,16))
#}
}
#dev.off() #close pdf file
cor.table<-data.frame("miRNA ID"=rep(miRNA.id,lng),
"miRNA"=rep(i,lng),
miRNA.genes.deg,
"Rho"=as.vector(cor.rho),
"pval"=as.vector(cor.pval),
"adj.pval"=p.adjust(cor.pval))
cor.table.f<-cor.table[cor.table$pval<0.05,] #just a soft threshold
#write.csv2(cor.table.f,
#file=file.path(resultsDir,paste(miRNA.id,"csv",sep=".")))
}
}
return(cor.table.f)
}
cor.table.f.returned<-miRNACorrel(res.miRNA=miRNA.res.hsa2,res.mRNA=mRNA.res2,data.miRNA= x_micro_backup,data.mRNA= x_rna_backup,resultsDir=resultsComb)
cor.table.f.returned
## miRNA.ID miRNA miRNA.genes.deg Rho pval adj.pval
## 14 hsa-let-7i hsa-let-7i CASP3 0.636 0.0479 0.671
## 15 hsa-let-7i hsa-let-7i GAB2 0.661 0.0376 0.564
CIRCOS PLOT DEPICTING PREVIOUSLY OBTAINED CORRELATION COEFFICIENTS ALONG WITH FILTERED, TPM-NORMALIZED, LOG-TRANSFORMED mRNA-SEQ COUNTS, miRNA-SEq COUNTS, AND ENCODED GISTIC CNV VALUES FOR SEPARATE OLD AND YOUNG PATIENT GROUPS:
options(stringsAsFactors = FALSE)
#RECALL FILTERED, NORMALIZED, LOG-TRANSFORMED mRNA-SEQ MATRIXES AND CNV DATAFRAME:
countsF_TPM_LOG<-log2(countsTPM[,1:10]+2)
countsF_TPM_LOG_DF<-as.data.frame(countsF_TPM_LOG)
countsF_TPM_LOG_DF$ID<-countsFInfo_backup$ID
countsF_TPM_LOG_DF$chr<-countsFInfo_backup$chr
countsF_TPM_LOG_DF$start<-countsFInfo_backup$start
countsF_TPM_LOG_DF$end<-countsFInfo_backup$end
cors.young[is.na(cors.young)] <- 0
#miRNA Expression
countsF_TPM_LOG_micro<-log2(countsTPM_micro[,1:10]+2)
countsF_TPM_LOG_DF_micro<-as.data.frame(countsF_TPM_LOG_micro)
countsF_TPM_LOG_DF_micro$ID<-countsFInfo_micro$ID
countsF_TPM_LOG_DF_micro$chr<-countsFInfo_micro$chromosome_name
countsF_TPM_LOG_DF_micro$start<-countsFInfo_micro$start_position
countsF_TPM_LOG_DF_micro$end<-countsFInfo_micro$end_position
countsF_TPM_LOG<-log2(countsTPM[,1:10]+2)
countsF_TPM_LOG_DF<-as.data.frame(countsF_TPM_LOG)
countsF_TPM_LOG_DF$ID<-countsFInfo_backup$ID
countsF_TPM_LOG_DF$chr<-countsFInfo_backup$chr
countsF_TPM_LOG_DF$start<-countsFInfo_backup$start
countsF_TPM_LOG_DF$end<-countsFInfo_backup$end
cors.young[is.na(cors.young)] <- 0
range(assays(mRNA_expr)$"exprs")
## [1] 0 206162
table(seqnames(rowRanges(mRNA_expr)))
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 22 11 13 7 9 5 9 8 9 8 10 11 2 2 7 7
## 17 18 19 20 21 22 X Y chrM
## 13 4 16 8 2 6 6 0 0
rowRanges(mRNA_expr)
## GRanges object with 195 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## DIRAS3 1 68511645-68516481 - | 9077
## MAPK14 6 35995454-36079013 + | 1432
## YAP1 11 101981192-102104154 + | 10413
## CDKN1B 12 12870302-12875305 + | 1027
## ERBB2 17 37844393-37884915 + | 2064
## ... ... ... ... . ...
## MACC1 7 20174279-20257013 - | 346389
## CHGA 14 93389445-93401638 + | 1113
## IDH3A 15 78441719-78462884 + | 3419
## SQSTM1 5 179233388-179265077 + | 8878
## KCNJ13 2 233630512-233641275 - | 3769
## -------
## seqinfo: 25 sequences (1 circular) from 2 genomes (GRCh37.p13, hg19)
#Already a GRanges Object (No need to unlist)
#mRNA_expr.gr<-unlist(rowRanges(mRNA_expr))#from a GRangesList to a GRanges object?
range_df<-as.data.frame(rowRanges(mRNA_expr))
range_df$gene_symbol<-rownames(range_df)
T.cors.old<-data.frame("chr"=range_df$seqnames,"Start"=as.integer(range_df$start),"End"=as.integer(range_df$end),cors.old,row.names=NULL)
T.cors.young<-data.frame("chr"=range_df$seqnames,"Start"=as.integer(range_df$start),"End"=as.integer(range_df$end),cors.young,row.names=NULL)
T.CN.old<-data.frame("chr"=countsFInfo_CNV_backup$chr,"Start"=as.integer(countsFInfo_CNV_backup$start),"End"=as.integer(countsFInfo_CNV_backup$end),countsFInfo_CNV_backup[,c(4,6,8,9,11)],row.names=NULL)
T.CN.young<-data.frame("chr"=countsFInfo_CNV_backup$chr,"Start"=as.integer(countsFInfo_CNV_backup$start),"End"=as.integer(countsFInfo_CNV_backup$end),countsFInfo_CNV_backup[,c(2,3,5,7,10)],row.names=NULL)
T.mRNA.old<-data.frame("chr"=countsF_TPM_LOG_DF$chr,"Start"=as.integer(countsF_TPM_LOG_DF$start),"End"=as.integer(countsF_TPM_LOG_DF$end),countsF_TPM_LOG_DF[,c(3,5,7,8,10)],row.names=NULL)
T.mRNA.young<-data.frame("chr"=countsF_TPM_LOG_DF$chr,"Start"=as.integer(countsF_TPM_LOG_DF$start),"End"=as.integer(countsF_TPM_LOG_DF$end),countsF_TPM_LOG_DF[,c(1,2,4,6,9)],row.names=NULL)
T.miRNA.old<-data.frame("chr"=countsF_TPM_LOG_DF_micro$chr,"Start"=as.integer(countsF_TPM_LOG_DF_micro$start),"End"=as.integer(countsF_TPM_LOG_DF_micro$end),countsF_TPM_LOG_DF_micro[,c(3,5,7,8,10)],row.names=NULL)
T.miRNA.young<-data.frame("chr"=countsF_TPM_LOG_DF_micro$chr,"Start"=as.integer(countsF_TPM_LOG_DF_micro$start),"End"=as.integer(countsF_TPM_LOG_DF_micro$end),countsF_TPM_LOG_DF_micro[,c(1,2,4,6,9)],row.names=NULL)
T_labels<-data.frame("chr"=range_df$seqnames,"Start"=as.integer(range_df$start),"End"=as.integer(range_df$end),range_df$gene_symbol,row.names=NULL)
#Plot of FILTERED, TPM-NORMALIZED, LOG-TRANSFORMED DATA VIA Circos FOR EACH OF THE TWO YOUNG AND OLD PATIENT GROUPS COMBINED PATIENTS
colors <- rainbow(10, alpha=0.5)
par(mar=c(2, 2, 2, 2))
plot(c(1,800), c(1,800), type="n", axes=FALSE, xlab="", ylab="", main="")
circos(R=300, cir="hg19", W=4, type="chr", print.chr.lab=TRUE, scale=TRUE)
circos(R=260, cir="hg19", W=40, mapping=T.miRNA.young,col.v=4,type="heatmap2", cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue")
circos(R=220, cir="hg19", W=40, mapping=T.miRNA.old,col.v=4,type="heatmap2", cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue")
circos(R=180, cir="hg19", W=40, mapping=T.mRNA.young,col.v=4,type="heatmap2", cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue")
circos(R=140, cir="hg19", W=40, mapping=T.mRNA.old,col.v=4,type="heatmap2", cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue")
circos(R=120, cir="hg19", W=20, mapping=T.CN.young, col.v=4, type="ml3", B=FALSE, lwd=1, cutoff=0)
circos(R=100, cir="hg19", W=20, mapping=T.CN.old, col.v=4, type="ml3", B=FALSE, lwd=1, cutoff=0)
circos(R=80, cir="hg19", W=20, mapping=T.cors.young, col.v=4, type="s", B=TRUE, lwd=1, col=colors[1])
circos(R=60, cir="hg19", W=20, mapping=T.cors.old, col.v=4, type="s", B=TRUE, lwd=1, col=colors[1])
#Adding labels for the genes
circos(R=310, cir="hg19", W=20, mapping=T_labels, type="label", side="out", col=c("black", "blue","red"), cex=0.4)
MULTI-FACTOR ANALYSIS (MFA)
##########################################GLOBAL MFA ON RAW CNV, mRNA-Seq, miRNA-Seq DATA####################################
cond<-as.factor(colData(miniACC.assays.comp.age)$years_to_birth)
dat4Facto<-data.frame(cond=cond,t(mACC.exp.c3),t(mACC.CN.c3),t(mACC.mir.c3))
rownames(dat4Facto) <- gsub("TCGA-","",rownames(cd3))
#We will consider CN as scaled but it would be better to consider it as categorical
res = MFA(dat4Facto, group=c(1,exp.l3,cn.l3,mir.l3), type=c("n","c","s","c"), ncp=5, name.group=c("cond","mRNA","CNV","miRNA"),num.group.sup=c(1))
#Extra informative plots
plot(res,choix="ind",habillage = "cond")
plotellipses(res, keepvar = "cond")
#There seems to be a clear separation between old and young patients.
#Patient sample OR-A5L5 and OR-A5LC appear to be an outlier and will be replaced with a different aged patient
########################################GLOBAL MFA ON FILTERED, NORMALIZED, LOG-TRANSFORMED CNV, mRNA-Seq, miRNA-Seq DATA#########################
mRNAexp.l<-nrow(countsF_TPM_LOG_DF_MFA)
cnv.l<-nrow(countsFInfo_CNV_backup_MFA)
miRNAexp.l<-nrow(countsF_TPM_LOG_DF_micro_MFA)
dat4Facto2<-data.frame(cond=as.factor(cond2),countsF_TPM_LOG_DF_MFA.t,countsFInfo_CNV_backup_MFA.t,countsF_TPM_LOG_DF_micro_MFA.t)
dim(dat4Facto2)
## [1] 10 670
#We will consider CN as scaled but it would be better to consider it as categorical
es2 = MFA(dat4Facto2, group=c(1,mRNAexp.l,cnv.l,miRNAexp.l), type=c("n","c","s","c"), ncp=5,name.group=c("cond2","mRNAexp","cnv","miRNAexp"),num.group.sup=c(1))
top10.1 <- sort(es2$global.pca$var$cor[,"Dim.1"],decreasing=TRUE)[1:10]
top10.1
## SMAD1.mRNAexp SRC.mRNAexp PIK3R1.mRNAexp PRKAA1.mRNAexp AKT3.mRNAexp
## 0.887 0.839 0.839 0.831 0.828
## NFKB1.mRNAexp MAPK9.mRNAexp AKT1.mRNAexp PRKCA.mRNAexp SQSTM1.mRNAexp
## 0.823 0.817 0.815 0.803 0.802
top10.2 <- sort(es2$global.pca$var$cor[,"Dim.2"],decreasing=TRUE)[1:10]
top10.2
## SRC.cnv TGM2.cnv E2F1.cnv NCOA3.cnv BCL2L1.cnv PRKAA1.cnv YWHAB.cnv
## 0.930 0.930 0.930 0.930 0.930 0.930 0.930
## PREX1.cnv CDKN1B.cnv ERBB3.cnv
## 0.930 0.926 0.926
top10.3 <- sort(es2$global.pca$var$cor[,"Dim.3"],decreasing=TRUE)[1:10]
top10.3
## hsa.mir.196a.2.miRNAexp hsa.mir.106b.miRNAexp hsa.mir.196a.1.miRNAexp
## 0.886 0.875 0.864
## hsa.mir.25.miRNAexp CDK1.mRNAexp hsa.mir.16.2.miRNAexp
## 0.848 0.793 0.793
## hsa.mir.196b.miRNAexp hsa.mir.92a.2.miRNAexp FOXM1.mRNAexp
## 0.790 0.785 0.776
## ACACB.mRNAexp
## 0.775
#Extra informative plots
plot(es2,choix="ind",habillage = "cond")
plotellipses(es2, keepvar = "cond")
fviz_mfa_ind(es2, label = "var", habillage = cond2, addEllipses = TRUE, ellipse.level = 0.95)
fviz_contrib(es2, choice = "quanti.var", axes = 1)
summary(es2)
##
## Call:
## MFA(base = dat4Facto2, group = c(1, mRNAexp.l, cnv.l, miRNAexp.l),
## type = c("n", "c", "s", "c"), ncp = 5, name.group = c("cond2",
## "mRNAexp", "cnv", "miRNAexp"), num.group.sup = c(1))
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 2.137 1.634 1.236 1.070 0.781 0.656 0.488
## % of var. 24.655 18.854 14.268 12.347 9.014 7.569 5.630
## Cumulative % of var. 24.655 43.509 57.777 70.123 79.138 86.707 92.336
## Dim.8 Dim.9
## Variance 0.430 0.235
## % of var. 4.957 2.706
## Cumulative % of var. 97.294 100.000
##
## Groups
## Dim.1 ctr cos2 Dim.2 ctr cos2
## mRNAexp | 0.848 39.705 0.359 | 0.615 37.652 0.189 |
## cnv | 0.454 21.258 0.098 | 0.920 56.312 0.402 |
## miRNAexp | 0.834 39.036 0.590 | 0.099 6.035 0.008 |
## Dim.3 ctr cos2
## mRNAexp 0.523 42.304 0.136 |
## cnv 0.434 35.121 0.090 |
## miRNAexp 0.279 22.575 0.066 |
##
## Supplementary group
## Dim.1 cos2 Dim.2 cos2 Dim.3 cos2
## cond2 | 0.193 0.037 | 0.045 0.002 | 0.088 0.008 |
##
## Individuals
## Dim.1 ctr cos2 Dim.2 ctr cos2
## TCGA.OR.A5J9.01A.11D.A29H.01 | 1.261 7.437 0.197 | 1.240 9.404 0.191 |
## TCGA.OR.A5JE.01A.11D.A29H.01 | -1.916 17.186 0.417 | 0.533 1.737 0.032 |
## TCGA.OR.A5JF.01A.11D.A29H.01 | 1.507 10.635 0.381 | 0.803 3.946 0.108 |
## TCGA.OR.A5JI.01A.11D.A29H.01 | 1.077 5.430 0.157 | 1.270 9.868 0.218 |
## TCGA.OR.A5K0.01A.11D.A29H.01 | 0.705 2.325 0.046 | -2.017 24.901 0.374 |
## TCGA.OR.A5KV.01A.11D.A29H.01 | -1.398 9.145 0.248 | -0.491 1.474 0.031 |
## TCGA.OR.A5L5.01A.11D.A29H.01 | 0.439 0.902 0.038 | 0.808 3.996 0.127 |
## TCGA.OR.A5LC.01A.11D.A29H.01 | -1.286 7.745 0.158 | 1.172 8.405 0.131 |
## TCGA.OR.A5LE.01A.11D.A29H.01 | -2.231 23.300 0.512 | -1.198 8.788 0.148 |
## TCGA.OR.A5LL.01A.11D.A29H.01 | 1.843 15.895 0.274 | -2.119 27.479 0.363 |
## Dim.3 ctr cos2
## TCGA.OR.A5J9.01A.11D.A29H.01 1.105 9.873 0.152 |
## TCGA.OR.A5JE.01A.11D.A29H.01 -0.413 1.381 0.019 |
## TCGA.OR.A5JF.01A.11D.A29H.01 0.222 0.400 0.008 |
## TCGA.OR.A5JI.01A.11D.A29H.01 -1.051 8.941 0.149 |
## TCGA.OR.A5K0.01A.11D.A29H.01 1.211 11.853 0.135 |
## TCGA.OR.A5KV.01A.11D.A29H.01 -1.566 19.845 0.312 |
## TCGA.OR.A5L5.01A.11D.A29H.01 -1.357 14.894 0.359 |
## TCGA.OR.A5LC.01A.11D.A29H.01 1.959 31.026 0.367 |
## TCGA.OR.A5LE.01A.11D.A29H.01 0.274 0.606 0.008 |
## TCGA.OR.A5LL.01A.11D.A29H.01 -0.382 1.181 0.012 |
##
## Continuous variables (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2
## DIRAS3.mRNAexp | -0.309 0.069 0.017 | 1.859 3.251 0.604 |
## MAPK14.mRNAexp | 0.438 0.138 0.294 | -0.309 0.090 0.147 |
## YAP1.mRNAexp | 0.468 0.158 0.286 | -0.597 0.335 0.465 |
## CDKN1B.mRNAexp | 0.357 0.092 0.254 | -0.403 0.152 0.324 |
## ERBB2.mRNAexp | 0.515 0.191 0.245 | -0.169 0.027 0.026 |
## G6PD.mRNAexp | 0.383 0.105 0.086 | 1.026 0.992 0.620 |
## KDR.mRNAexp | 0.557 0.223 0.110 | 1.281 1.543 0.579 |
## AKT1S1.mRNAexp | 0.007 0.000 0.000 | 0.165 0.026 0.055 |
## MAPK8.mRNAexp | 0.563 0.228 0.399 | -0.337 0.107 0.143 |
## PRKCD.mRNAexp | 0.015 0.000 0.000 | 0.138 0.018 0.030 |
## Dim.3 ctr cos2
## DIRAS3.mRNAexp -0.444 0.245 0.034 |
## MAPK14.mRNAexp 0.347 0.150 0.184 |
## YAP1.mRNAexp -0.265 0.087 0.091 |
## CDKN1B.mRNAexp 0.281 0.098 0.158 |
## ERBB2.mRNAexp -0.230 0.066 0.049 |
## G6PD.mRNAexp 0.147 0.027 0.013 |
## KDR.mRNAexp -0.194 0.047 0.013 |
## AKT1S1.mRNAexp 0.177 0.039 0.063 |
## MAPK8.mRNAexp 0.144 0.026 0.026 |
## PRKCD.mRNAexp 0.057 0.004 0.005 |
##
## Supplementary categories
## Dim.1 cos2 v.test Dim.2 cos2 v.test
## old | 0.642 0.434 1.317 | -0.271 0.077 -0.635 |
## young | -0.642 0.434 -1.317 | 0.271 0.077 0.635 |
## Dim.3 cos2 v.test
## old 0.330 0.115 0.892 |
## young -0.330 0.115 -0.892 |
#Overall, MFA helps understand the underlying structure of the data by reducing its dimensionality and highlighting the relationships between variables and observations.
#Based on MFA summary eigenvalues, the first three dimensions of MFA capture 57.77% (24.66% (dim1)+18.85% (dim2) + 14.268 (dim3)) of total variance.
#Based on MFA summary group analysis, compared to GISTIC cnv recurrent lesions, the miRNA-seq and mRNA-seq variables
#co-contribute most and have highest significant impact to the first dimension, while GISTIC cnv contributes the most towards
#dimension#2 (0.9 vs. 0.009). The top genes impacting dimension#1 are (from mRNA-seq data block variable)
#SMAD1,SRC, PIK3R1, PRKAA1, AKT3, NFKB1, MAPK9, AKT1, PRKCA. and SQSTM1.
#The top genes impacting dimension#2(from GISTIC CNV gene-based recurrent lesions data block variable) are SRC,
#TGM2, E2F1, NCOA3, BCL2L1, PRKAA1, YWHAB, PREX1, CDKN1B, and ERBB3. The top genes impacting dimension#3(from miRNA-seq data block variable)
#are hsa.mir.196a.2,hsa.mir.106b, hsa.mir.196a.1, hsa.mir.25, hsa.mir.16.2, hsa.mir.196b, hsa.mir.92a.2,
#and (from mRNA-seq data block)CDK1, FOXM1,and ACACB.
#Based on MFA analysis, there is clear separation between cnv, mRNA, and miRNA block data
#Based on individuals Analysis examining how individual data points relate to each dimension,
#the first ten individuals show their positions in the multidimensional space.No clear segregation between young and old patient samples is apparent. Of the ten selected patient samples,
#A5J9 (young), A5JF(old),A5JI(young),A5K0(old),A5L5(old),A5LL(old) contribute positive coefficients towards dimension#1, while
#A5JE (young), A5KV(young),A5LC(old),A5LE(young) contribute negative coefficients towards dimension#1
#Young Patients TCGA.OR.A5LE, A5J9, A5JE appear to be outliers. Old patients A5K0, A5LL, A5JF, and A5LC appear to be outliers, suggesting that the
#10 patients selected were not appropriate. The mRNA expression dimension seem to coincide with the age.status condition
#more than the other 2 data blocks.#Based on MFA continuous Variables analysis, which indicates the relationship between the original variables,
#and the extracted dimensions, the mRNA-seq data block genes strongly influence Dimension 1 compared to miRNA-seq and GISTIC CNV data block variables.